Scoring Guide - Ensemble

Overview

This guide provides practical patterns for implementing quality scoring in your workflows. Learn when to use each evaluator type, how to set appropriate thresholds, and optimize for cost and performance. See Scoring Concept for foundational concepts.

When to Use Scoring

✅ Good Use Cases

AI-generated content - Validate quality, tone, accuracy
Data extraction - Ensure all required fields present
Content moderation - Check for inappropriate content
Translation quality - Validate translation accuracy
Summarization - Verify completeness and clarity

❌ Not Needed

Simple calculations - Deterministic operations don’t need scoring
Data retrieval - Database queries are either successful or not
API calls - HTTP status codes indicate success/failure

Choosing an Evaluator

Rule-Based Evaluator

When to use:

Fast validation needed (< 1ms)
Clear, objective criteria
Structured data with known schema
Cost is a concern

Example:

scoring:
  evaluator: validate
  evaluatorConfig:
    type: rule
  criteria:
    hasTitle:
      rule: "output.title != null && output.title.length > 0"
      weight: 0.3
    hasBody:
      rule: "output.body.length >= 100"
      weight: 0.4
    hasAuthor:
      rule: "output.author != null"
      weight: 0.3

LLM Judge Evaluator

When to use:

Subjective quality assessment
Natural language evaluation
Complex multi-dimensional criteria
Human-like judgment needed

Example:

scoring:
  evaluator: validate
  evaluatorConfig:
    type: judge
    model: gpt-4o
    systemPrompt: |
      You are an expert content evaluator.
      Score content on accuracy, clarity, and engagement.
  criteria:
    accuracy: "Content is factually accurate"
    clarity: "Writing is clear and easy to understand"
    engagement: "Content maintains reader interest"

NLP Evaluator

When to use:

Sentiment analysis
Readability scoring
Keyword presence checking
Language-specific metrics

Example:

scoring:
  evaluator: validate
  evaluatorConfig:
    type: nlp
  criteria:
    sentiment:
      target: "positive"
      weight: 0.4
    readability:
      target: "easy"
      weight: 0.3
    keywords:
      required: ["innovation", "growth"]
      weight: 0.3

Embedding Similarity

When to use:

Style matching
Consistency checking
Example-based validation
Semantic similarity

Example:

scoring:
  evaluator: validate
  evaluatorConfig:
    type: embedding
    referenceExamples:
      - "High-quality example 1"
      - "High-quality example 2"
    similarityThreshold: 0.8

Setting Thresholds

Start Lenient, Tighten Gradually

# Week 1: Observe
thresholds:
  minimum: 0.6

# Week 2: After collecting data
thresholds:
  minimum: 0.7

# Week 3: Final target
thresholds:
  minimum: 0.8

Different Thresholds by Importance

# Critical content - strict
- member: generate-legal-doc
  scoring:
    thresholds:
      minimum: 0.95

# Marketing content - moderate
- member: generate-blog-post
  scoring:
    thresholds:
      minimum: 0.8

# Internal notes - lenient
- member: generate-summary
  scoring:
    thresholds:
      minimum: 0.6

Weighting Criteria

Equal Weights

criteria:
  accuracy: "Must be accurate"
  clarity: "Must be clear"
  completeness: "Must be complete"
# All get 0.33 weight automatically

Custom Weights

criteria:
  critical_field:
    rule: "output.required != null"
    weight: 0.6  # 60% of score

  nice_to_have:
    rule: "output.optional != null"
    weight: 0.2  # 20% of score

  formatting:
    rule: "output.format === 'json'"
    weight: 0.2  # 20% of score

Retry Strategies

Exponential Backoff (Recommended)

scoring:
  onFailure: retry
  retryLimit: 5
  backoffStrategy: exponential
# Delays: 1s, 2s, 4s, 8s, 16s

Best for:

AI provider rate limits
Non-deterministic failures
Production workloads

Linear Backoff

scoring:
  onFailure: retry
  retryLimit: 3
  backoffStrategy: linear
# Delays: 2s, 4s, 6s

Best for:

Predictable retry timing
Simple scenarios

No Backoff (Fixed)

scoring:
  onFailure: retry
  retryLimit: 3
  backoffStrategy: fixed
# Delays: 5s, 5s, 5s

Best for:

Testing
Immediate retries desired

Progressive Improvement

Require each retry to score higher:

scoring:
  requireImprovement: true
  minImprovement: 0.05  # Must improve by 5%
  thresholds:
    minimum: 0.8
  retryLimit: 5

Example:

Attempt 1: 0.65 ❌ Retry
Attempt 2: 0.70 ✅ Improved, retry
Attempt 3: 0.68 ❌ Decreased, reject
Attempt 4: 0.75 ✅ Improved, retry
Attempt 5: 0.82 ✅ Passed!

Failure Handling

Retry (Default)

scoring:
  onFailure: retry  # Keep trying until pass or max retries
  retryLimit: 3

Continue

scoring:
  onFailure: continue  # Log failure but continue workflow

Use for non-critical quality checks.

Abort

scoring:
  onFailure: abort  # Stop execution immediately

Use for critical quality requirements.

Layered Validation

Fast checks first, expensive checks later:

flow:
  # Layer 1: Fast rule check (free, instant)
  - member: generate-content
    scoring:
      evaluatorConfig:
        type: rule
      criteria:
        minLength:
          rule: "output.text.length >= 50"
      thresholds:
        minimum: 1.0
      onFailure: retry
      retryLimit: 2

  # Layer 2: AI judge (expensive, ~1s)
  - member: validate-quality
    input:
      content: ${generate-content.output}
    scoring:
      evaluatorConfig:
        type: judge
        model: gpt-4o
      thresholds:
        minimum: 0.85
      onFailure: retry
      retryLimit: 3

Cost Optimization

Use Cheaper Models for Judging

# ✅ Good - use mini for scoring
scoring:
  evaluatorConfig:
    type: judge
    model: gpt-4o-mini  # 97% cheaper

# ❌ Expensive - flagship for scoring
scoring:
  evaluatorConfig:
    type: judge
    model: gpt-4o

Cache Evaluations

- member: generate-content
  cache:
    ttl: 3600  # Cache content and scores
  scoring:
    evaluator: validate

Lower Temperature for Consistency

config:
  temperature: 0.2  # More deterministic = better cache hit rate

Set Reasonable Retry Limits

# ✅ Good - balanced
retryLimit: 3-5

# ❌ Too many - expensive
retryLimit: 20

Real-World Patterns

Blog Post Generation

- member: generate-blog-post
  scoring:
    evaluatorConfig:
      type: judge
      model: claude-3-5-sonnet-20241022
    criteria:
      engagement: "Highly engaging and interesting"
      structure: "Well-structured with clear flow"
      accuracy: "Factually accurate"
      seo: "SEO-optimized with keywords"
    thresholds:
      minimum: 0.85
    onFailure: retry
    retryLimit: 5
    requireImprovement: true

Data Extraction

- member: extract-company-info
  scoring:
    evaluatorConfig:
      type: rule
    criteria:
      hasName:
        rule: "output.name != null"
        weight: 0.3
      hasIndustry:
        rule: "output.industry != null"
        weight: 0.2
      hasEmployees:
        rule: "output.employees > 0"
        weight: 0.2
      hasLocation:
        rule: "output.location != null"
        weight: 0.3
    thresholds:
      minimum: 0.8
    onFailure: retry
    retryLimit: 3

Customer Response

- member: generate-response
  scoring:
    evaluatorConfig:
      type: nlp
    criteria:
      sentiment:
        target: "positive"
        weight: 0.4
      tone:
        target: "professional"
        weight: 0.3
      readability:
        target: "easy"
        weight: 0.3
    thresholds:
      minimum: 0.8
    onFailure: retry
    retryLimit: 3

Monitoring and Debugging

Track Scoring Metrics

const result = await executor.executeEnsemble(ensemble, input);

console.log('Scoring Metrics:', {
  score: result.output.quality.score,
  passed: result.output.quality.passed,
  attempts: result.output.quality.attempts,
  feedback: result.output.quality.feedback,
  breakdown: result.output.quality.breakdown
});

// Alert if quality consistently low
if (result.output.quality.attempts > 3) {
  console.warn('High retry count - review thresholds or prompts');
}

Log Scoring Details

output:
  result: ${generate-content.output}
  quality:
    score: ${generate-content.scoring.score}
    passed: ${generate-content.scoring.passed}
    attempts: ${generate-content.scoring.attempts}
    breakdown: ${generate-content.scoring.breakdown}
    feedback: ${generate-content.scoring.feedback}

Testing Scoring

import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';

describe('scoring validation', () => {
  it('should pass high-quality content', async () => {
    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          responses: {
            'generate-content': {
              text: 'High-quality content',
              score: 0.95
            }
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('with-scoring', input);

    expect(result).toBeSuccessful();
    expect(result.output.quality.score).toBeGreaterThan(0.8);
    expect(result.output.quality.attempts).toBe(1);
  });

  it('should retry low-quality content', async () => {
    let attempts = 0;

    const conductor = await TestConductor.create({
      mocks: {
        ai: {
          handler: async () => {
            attempts++;
            return {
              text: 'Content',
              score: attempts === 1 ? 0.6 : 0.9
            };
          }
        }
      }
    });

    const result = await conductor.executeEnsemble('with-scoring', input);

    expect(result).toBeSuccessful();
    expect(result.output.quality.attempts).toBe(2);
  });
});

Best Practices

Start with rule-based - Fast and free for objective criteria
Use AI judges sparingly - Only when subjective assessment needed
Set lenient initial thresholds - Tighten based on data
Require improvement - Each retry should score higher
Layer validations - Fast checks first, expensive later
Cache aggressively - Reduce redundant evaluations
Monitor metrics - Track success rates and costs
Test thoroughly - Verify scoring works as expected

Scoring Concept

Foundational scoring concepts

Validate Member

Complete validate member reference

Scoring Example

Real-world scoring examples

Testing

Test scoring configurations

Conductor

Core Concepts

Guides

Member Types

Built-In Members

Examples

API Reference

Deployment

​Overview

​When to Use Scoring

​✅ Good Use Cases

​❌ Not Needed

​Choosing an Evaluator

​Rule-Based Evaluator

​LLM Judge Evaluator

​NLP Evaluator

​Embedding Similarity

​Setting Thresholds

​Start Lenient, Tighten Gradually

​Different Thresholds by Importance

​Weighting Criteria

​Equal Weights

​Custom Weights

​Retry Strategies

​Exponential Backoff (Recommended)

​Linear Backoff

​No Backoff (Fixed)

​Progressive Improvement

​Failure Handling

​Retry (Default)

​Continue

​Abort

​Layered Validation

​Cost Optimization

​Use Cheaper Models for Judging

​Cache Evaluations

​Lower Temperature for Consistency

​Set Reasonable Retry Limits

​Real-World Patterns

​Blog Post Generation

​Data Extraction

​Customer Response

​Monitoring and Debugging

​Track Scoring Metrics

​Log Scoring Details

​Testing Scoring

​Best Practices

​Related Documentation

Scoring Concept

Validate Member

Scoring Example

Testing

Overview

When to Use Scoring

✅ Good Use Cases

❌ Not Needed

Choosing an Evaluator

Rule-Based Evaluator

LLM Judge Evaluator

NLP Evaluator

Embedding Similarity

Setting Thresholds

Start Lenient, Tighten Gradually

Different Thresholds by Importance

Weighting Criteria

Equal Weights

Custom Weights

Retry Strategies

Exponential Backoff (Recommended)

Linear Backoff

No Backoff (Fixed)

Progressive Improvement

Failure Handling

Retry (Default)

Continue

Abort

Layered Validation

Cost Optimization

Use Cheaper Models for Judging

Cache Evaluations

Lower Temperature for Consistency

Set Reasonable Retry Limits

Real-World Patterns

Blog Post Generation

Data Extraction

Customer Response

Monitoring and Debugging

Track Scoring Metrics

Log Scoring Details

Testing Scoring

Best Practices

Related Documentation