Skip to main content

Overview

The Validate member provides quality scoring with multiple evaluator types (rule-based, LLM judge, NLP, embedding similarity) and automatic retry until quality thresholds are met. This member is the foundation of Conductor’s scoring system, enabling production-grade quality control for AI-generated content. See Scoring Guide for comprehensive patterns and best practices.

Quick Example

name: validate-content
description: Generate and validate content quality

flow:
  - member: generate-content
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: gpt-4o-mini
      criteria:
        accuracy: "Content is factually accurate"
        clarity: "Writing is clear and understandable"
        completeness: "All key points are covered"
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 5
      requireImprovement: true

output:
  content: ${generate-content.output.text}
  quality: ${generate-content.scoring}

Evaluator Types

Rule-Based Evaluator

Fast, deterministic validation for structured data:
scoring:
  evaluator: validate
  evaluatorConfig:
    type: rule
  criteria:
    hasTitle:
      rule: "output.title != null && output.title.length > 0"
      weight: 0.3
    hasBody:
      rule: "output.body != null && output.body.length >= 100"
      weight: 0.4
    hasAuthor:
      rule: "output.author != null"
      weight: 0.3
  thresholds:
    minimum: 0.8
When to use:
  • Structured data validation
  • Fast validation (< 1ms)
  • Objective criteria
  • Zero cost

LLM Judge Evaluator

AI-powered subjective quality assessment:
scoring:
  evaluator: validate
  evaluatorConfig:
    type: judge
    model: claude-3-5-sonnet-20241022
    systemPrompt: |
      You are an expert content evaluator.
      Score content on accuracy, clarity, and engagement.
  criteria:
    accuracy: "Content is factually accurate"
    clarity: "Writing is clear and easy to understand"
    engagement: "Content maintains reader interest"
  thresholds:
    minimum: 0.85
  onFailure: retry
  retryLimit: 5
When to use:
  • Subjective quality assessment
  • Natural language evaluation
  • Complex multi-dimensional criteria
  • Human-like judgment needed

NLP Evaluator

Natural language processing metrics:
scoring:
  evaluator: validate
  evaluatorConfig:
    type: nlp
  criteria:
    sentiment:
      target: "positive"
      weight: 0.4
    readability:
      target: "easy"
      weight: 0.3
    tone:
      target: "professional"
      weight: 0.3
  thresholds:
    minimum: 0.8
When to use:
  • Sentiment analysis
  • Readability scoring
  • Tone matching
  • Language-specific metrics

Embedding Similarity

Semantic similarity comparison:
scoring:
  evaluator: validate
  evaluatorConfig:
    type: embedding
    referenceExamples:
      - "High-quality example 1"
      - "High-quality example 2"
    similarityThreshold: 0.8
  thresholds:
    minimum: 0.8
When to use:
  • Style matching
  • Consistency checking
  • Example-based validation
  • Semantic similarity

Configuration

Scoring Configuration

scoring:
  evaluator: validate           # Use validate member
  evaluatorConfig:
    type: rule | judge | nlp | embedding
    model: string              # For LLM judge
    systemPrompt: string       # For LLM judge
    referenceExamples: array   # For embedding
  criteria:
    criterion1: "description" | { rule, weight }
    criterion2: "description" | { rule, weight }
  thresholds:
    minimum: number            # Required minimum score
  onFailure: retry | continue | abort
  retryLimit: number
  requireImprovement: boolean
  minImprovement: number
  backoffStrategy: exponential | linear | fixed

Output Format

output:
  score: number                # Overall score (0-1)
  passed: boolean             # Whether thresholds met
  attempts: number            # Number of attempts
  breakdown: object           # Score per criterion
  feedback: string            # Detailed feedback
  improvements: array         # Suggestions for improvement

Common Patterns

Content Generation with Validation

name: generate-quality-content
description: Generate content with automatic quality retry

flow:
  - member: generate-article
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: gpt-4o-mini
      criteria:
        engagement: "Highly engaging and interesting"
        structure: "Well-structured with clear flow"
        accuracy: "Factually accurate"
        seo: "SEO-optimized with keywords"
      thresholds:
        minimum: 0.85
      onFailure: retry
      retryLimit: 5
      requireImprovement: true

output:
  article: ${generate-article.output.text}
  quality: ${generate-article.scoring}

Data Extraction Validation

name: extract-and-validate
description: Extract structured data and validate completeness

flow:
  - member: extract-company-data
    type: Think
    config:
      provider: openai
      model: gpt-4o
      responseFormat:
        type: json_object
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: rule
      criteria:
        hasName:
          rule: "output.name != null"
          weight: 0.3
        hasIndustry:
          rule: "output.industry != null"
          weight: 0.2
        hasEmployees:
          rule: "output.employees > 0"
          weight: 0.2
        hasLocation:
          rule: "output.location != null"
          weight: 0.3
      thresholds:
        minimum: 0.8
      onFailure: retry
      retryLimit: 3

output:
  data: ${extract-company-data.output}
  validation: ${extract-company-data.scoring}

Progressive Improvement

name: progressive-quality
description: Each retry must improve on previous

flow:
  - member: generate-content
    type: Think
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: claude-3-5-sonnet-20241022
      thresholds:
        minimum: 0.85
      requireImprovement: true
      minImprovement: 0.05  # Must improve by 5%
      retryLimit: 5

# Example progression:
# Attempt 1: 0.70 ❌ (below threshold, retry)
# Attempt 2: 0.75 ✅ (improved, but still below, retry)
# Attempt 3: 0.73 ❌ (decreased, reject)
# Attempt 4: 0.80 ✅ (improved, but still below, retry)
# Attempt 5: 0.87 ✅ (improved and passed!)

output:
  content: ${generate-content.output}
  finalScore: ${generate-content.scoring.score}
  attempts: ${generate-content.scoring.attempts}

Multi-Stage Validation

name: layered-validation
description: Fast checks first, expensive checks later

flow:
  # Layer 1: Fast rule check (free, instant)
  - member: generate-content
    type: Think
    scoring:
      evaluatorConfig:
        type: rule
      criteria:
        minLength:
          rule: "output.text.length >= 100"
          weight: 1.0
      thresholds:
        minimum: 1.0
      onFailure: retry
      retryLimit: 2

  # Layer 2: AI judge (expensive, ~1s)
  - member: validate-quality
    input:
      content: ${generate-content.output}
    scoring:
      evaluator: validate
      evaluatorConfig:
        type: judge
        model: gpt-4o-mini
      criteria:
        quality: "High quality content"
        accuracy: "Factually accurate"
      thresholds:
        minimum: 0.85
      onFailure: retry
      retryLimit: 3

output:
  content: ${generate-content.output.text}
  ruleScore: ${generate-content.scoring.score}
  qualityScore: ${validate-quality.scoring.score}

Failure Handling

Retry (Default)

scoring:
  onFailure: retry  # Keep trying until pass or max retries
  retryLimit: 3

Continue

scoring:
  onFailure: continue  # Log failure but continue workflow
Use for non-critical quality checks.

Abort

scoring:
  onFailure: abort  # Stop execution immediately
Use for critical quality requirements.

Cost Optimization

Use Cheaper Models for Judging

# ✅ Good - use mini for scoring
scoring:
  evaluatorConfig:
    type: judge
    model: gpt-4o-mini  # 97% cheaper

# ❌ Expensive - flagship for scoring
scoring:
  evaluatorConfig:
    type: judge
    model: gpt-4o

Layer Validations

# Fast rule check first
scoring:
  evaluatorConfig:
    type: rule
  onFailure: retry
  retryLimit: 2

# Only run expensive AI judge if rules pass

Set Reasonable Retry Limits

# ✅ Good - balanced
retryLimit: 3-5

# ❌ Too many - expensive
retryLimit: 20

Testing

import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';

describe('validate member', () => {
  it('should pass high-quality content', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          responses: {
            'generate-content': {
              text: 'High-quality content',
              score: 0.95
            }
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('generate-quality-content', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.quality.score).toBeGreaterThan(0.8);
    expect(result.output.quality.passed).toBe(true);
    expect(result.output.quality.attempts).toBe(1);
  });

  it('should retry low-quality content', async () => {
    let attempts = 0;

    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          handler: async () => {
            attempts++;
            return {
              text: 'Content',
              score: attempts === 1 ? 0.6 : 0.9  // Low then high
            };
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('generate-quality-content', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.quality.attempts).toBe(2);
    expect(result.output.quality.passed).toBe(true);
  });

  it('should require progressive improvement', async () => {
    let attempts = 0;

    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          handler: async () => {
            attempts++;
            // Scores: 0.7, 0.75, 0.73 (rejected), 0.80, 0.87
            const scores = [0.7, 0.75, 0.73, 0.80, 0.87];
            return {
              text: 'Content',
              score: scores[attempts - 1]
            };
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('progressive-quality', {
      topic: 'AI'
    });

    expect(result).toBeSuccessful();
    expect(result.output.finalScore).toBeGreaterThanOrEqual(0.85);
    expect(result.output.attempts).toBeLessThanOrEqual(5);
  });
});

Best Practices

  1. Start lenient, tighten gradually - Begin with 0.6-0.7, increase based on data
  2. Use multiple evaluators - Layer fast rules, then AI judge
  3. Weight criteria appropriately - Critical features get higher weight
  4. Set reasonable retry limits - 3-5 retries balances quality and cost
  5. Require improvement - Each retry should score higher
  6. Monitor metrics - Track success rates and costs
  7. Cache evaluations - Reduce redundant AI judge calls
  8. Test thoroughly - Verify scoring works as expected