Skip to main content

Overview

Quality scoring validates output quality and automatically retries until meeting your standards. Perfect for AI-generated content that needs consistency and accuracy.

Basic Scoring with Retry

name: generate-quality-content
description: Generate content with automatic quality retry

flow:
  - member: generate-content
    type: Think
    config:
      provider: openai
      model: gpt-4o
      routing: cloudflare-gateway
      temperature: 0.7
    input:
      topic: ${input.topic}
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: gpt-4o
      criteria:
        accuracy: "Content is factually accurate"
        completeness: "All key points are covered"
        clarity: "Writing is clear and easy to understand"
        grammar: "Perfect grammar and spelling"
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 5
      requireImprovement: true  # Each retry must score higher

output:
  content: ${generate-content.output.text}
  quality:
    score: ${generate-content.scoring.score}
    passed: ${generate-content.scoring.passed}
    attempts: ${generate-content.scoring.attempts}
    feedback: ${generate-content.scoring.feedback}
See Scoring Concept for detailed explanation.

Progressive Improvement

Each retry must score higher than the previous:
name: progressive-improvement
description: Content quality improves with each retry

flow:
  - member: generate-article
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      thresholds:
        minimum: 0.85
      requireImprovement: true
      minImprovement: 0.05  # Must improve by at least 5%
      onFailure: retry
      retryLimit: 5

# Example progression:
# Attempt 1: 0.70 ❌ (below threshold, retry)
# Attempt 2: 0.75 ✅ (improved, but still below, retry)
# Attempt 3: 0.73 ❌ (decreased, reject)
# Attempt 4: 0.80 ✅ (improved, but still below, retry)
# Attempt 5: 0.87 ✅ (improved and passed!)

output:
  article: ${generate-article.output}
  finalScore: ${generate-article.scoring.score}
  attempts: ${generate-article.scoring.attempts}
  improvement: ${generate-article.scoring.score - generate-article.scoring.initialScore}

Rule-Based Scoring

Fast, deterministic validation:
name: validate-structured-output
description: Rule-based validation for structured data

flow:
  - member: extract-company-data
    type: Think
    config:
      provider: openai
      model: gpt-4o
      responseFormat:
        type: json_object
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: rule
      criteria:
        hasName:
          rule: "output.name != null && output.name.length > 0"
          weight: 0.3
        hasIndustry:
          rule: "output.industry != null"
          weight: 0.2
        hasEmployees:
          rule: "output.employees > 0"
          weight: 0.2
        hasFounded:
          rule: "output.founded >= 1800 && output.founded <= 2024"
          weight: 0.3
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 3

output:
  companyData: ${extract-company-data.output}
  validationScore: ${extract-company-data.scoring.score}
  breakdown: ${extract-company-data.scoring.breakdown}

LLM Judge Scoring

AI-powered quality assessment:
name: content-quality-judge
description: AI evaluates content quality

flow:
  - member: write-blog-post
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: claude-3-5-sonnet-20241022
        systemPrompt: |
          You are an expert content quality evaluator.
          Evaluate blog posts on multiple dimensions.
          Be strict but fair.
      criteria:
        engagement: "Content is engaging and maintains reader interest"
        structure: "Clear structure with logical flow"
        depth: "Sufficient depth and detail on the topic"
        originality: "Original insights, not generic content"
        actionability: "Provides actionable takeaways"
      thresholds:
        minimum: 0.85
      onFailure: retry
      retryLimit: 5
      requireImprovement: true

output:
  blogPost: ${write-blog-post.output}
  quality: ${write-blog-post.scoring}

NLP-Based Scoring

Natural language metrics:
name: sentiment-and-readability
description: Validate sentiment and readability

flow:
  - member: generate-customer-response
    type: Think
    config:
      provider: openai
      model: gpt-4o-mini
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: nlp
      criteria:
        sentiment:
          target: "positive"
          weight: 0.4
        readability:
          target: "easy"
          weight: 0.3
        tone:
          target: "professional"
          weight: 0.3
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 3

output:
  response: ${generate-customer-response.output}
  sentimentScore: ${generate-customer-response.scoring.breakdown.sentiment}
  readabilityScore: ${generate-customer-response.scoring.breakdown.readability}

Embedding Similarity

Compare against reference examples:
name: style-matching
description: Match reference style

flow:
  - member: generate-description
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: embedding
        referenceExamples:
          - "Acme Corp revolutionizes enterprise software with AI-powered automation that reduces manual work by 80%. Founded in 2020, they serve over 500 Fortune 1000 companies."
          - "TechStart delivers cutting-edge cloud solutions for modern businesses. Their platform processes 10 billion transactions daily across 50 countries."
        similarityThreshold: 0.75
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 4

output:
  description: ${generate-description.output}
  similarityScore: ${generate-description.scoring.score}

Multi-Stage Validation

Different quality checks at each stage:
name: multi-stage-content-creation
description: Progressive quality gates

flow:
  # Stage 1: Generate outline (lenient)
  - member: generate-outline
    type: Think
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: rule
      criteria:
        hasSections:
          rule: "output.sections.length >= 3"
          weight: 0.5
        hasIntro:
          rule: "output.sections.includes('Introduction')"
          weight: 0.25
        hasConclusion:
          rule: "output.sections.includes('Conclusion')"
          weight: 0.25
      thresholds:
        minimum: 0.7
      onFailure: retry
      retryLimit: 3

  # Stage 2: Generate content (moderate)
  - member: generate-content
    type: Think
    input:
      outline: ${generate-outline.output}
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: gpt-4o
      criteria:
        completeness: "All outline sections are developed"
        coherence: "Content flows logically"
        quality: "Writing is professional"
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 5
      requireImprovement: true

  # Stage 3: Polish content (strict)
  - member: polish-content
    type: Think
    input:
      content: ${generate-content.output}
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: claude-3-5-sonnet-20241022
      criteria:
        grammar: "Perfect grammar and spelling"
        style: "Consistent professional style"
        engagement: "Highly engaging"
        accuracy: "Factually accurate"
      thresholds:
        minimum: 0.9
      onFailure: retry
      retryLimit: 5
      requireImprovement: true

output:
  finalContent: ${polish-content.output}
  quality:
    outline: ${generate-outline.scoring.score}
    content: ${generate-content.scoring.score}
    polished: ${polish-content.scoring.score}
  totalAttempts: ${generate-outline.scoring.attempts + generate-content.scoring.attempts + polish-content.scoring.attempts}

Failure Handling Strategies

Continue on Low Quality

scoring:
  onFailure: continue  # Log but don't fail workflow

Abort on Low Quality

scoring:
  onFailure: abort  # Stop execution immediately

Retry with Backoff

scoring:
  onFailure: retry
  retryLimit: 5
  backoffStrategy: exponential  # 1s, 2s, 4s, 8s, 16s

Cost Optimization

Fast Pre-Check with Rules

flow:
  # Fast rule-based pre-check
  - member: generate-content
    scoring:
      evaluatorConfig:
        type: rule
      criteria:
        minLength:
          rule: "output.text.length >= 100"
          weight: 1.0
      thresholds:
        minimum: 1.0
      onFailure: retry
      retryLimit: 2

  # Only run expensive AI judge if rules pass
  - member: final-quality-check
    input:
      content: ${generate-content.output}
    scoring:
      evaluatorConfig:
        type: judge
        model: gpt-4o
      thresholds:
        minimum: 0.85
      onFailure: retry
      retryLimit: 3

Cache Evaluations

- member: generate-content
  cache:
    ttl: 3600  # Cache generated content
  scoring:
    # Scoring results are also cached
    evaluator: validate

Adjust Temperature

config:
  temperature: 0.3  # Lower = more deterministic = fewer retries needed

Testing Scoring

import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';

describe('scoring and retry', () => {
  it('should pass on high quality', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          responses: {
            'generate-content': {
              text: 'High quality content',
              score: 0.95
            }
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('generate-quality-content', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.quality.score).toBeGreaterThan(0.8);
    expect(result.output.quality.passed).toBe(true);
    expect(result.output.quality.attempts).toBe(1);
  });

  it('should retry on low quality', async () => {
    let attempts = 0;

    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          handler: async () => {
            attempts++;
            return {
              text: 'Content',
              score: attempts === 1 ? 0.6 : 0.85  // Low then high
            };
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('generate-quality-content', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.quality.attempts).toBe(2);
    expect(result.output.quality.passed).toBe(true);
  });

  it('should require progressive improvement', async () => {
    let attempts = 0;

    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          handler: async () => {
            attempts++;
            // Scores: 0.7, 0.75, 0.73 (rejected), 0.80, 0.87
            const scores = [0.7, 0.75, 0.73, 0.80, 0.87];
            return {
              text: 'Content',
              score: scores[attempts - 1]
            };
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('progressive-improvement', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.finalScore).toBeGreaterThanOrEqual(0.85);
    expect(result.output.attempts).toBeLessThanOrEqual(5);
  });
});

Monitoring Scoring Metrics

const result = await executor.executeEnsemble('generate-quality-content', input);

// Track scoring effectiveness
console.log('Quality Metrics:', {
  score: result.output.quality.score,
  passed: result.output.quality.passed,
  attempts: result.output.quality.attempts,
  avgScore: result.metrics?.scoringMetrics?.avgScore,
  successRate: result.metrics?.scoringMetrics?.successRate,
  costPerSuccess: result.metrics?.scoringMetrics?.avgCost
});

// Alert if quality is consistently low
if (result.output.quality.attempts > 3) {
  console.warn('High retry count - consider adjusting thresholds or prompts');
}

Best Practices

  1. Start lenient, tighten gradually - Begin with 0.6-0.7, increase based on data
  2. Use multiple evaluators - Fast rules first, then AI judge
  3. Weight criteria appropriately - Critical features get higher weight
  4. Set reasonable retry limits - 3-5 retries balances quality and cost
  5. Monitor metrics - Track success rates and costs
  6. Cache evaluations - Reduce redundant AI judge calls
  7. Require improvement - Each retry should score higher
  8. Test thoroughly - Verify scoring works as expected