Overview
Quality scoring validates output quality and automatically retries until meeting your standards. Perfect for AI-generated content that needs consistency and accuracy.Basic Scoring with Retry
Copy
name: generate-quality-content
description: Generate content with automatic quality retry
flow:
- member: generate-content
type: Think
config:
provider: openai
model: gpt-4o
routing: cloudflare-gateway
temperature: 0.7
input:
topic: ${input.topic}
scoring:
evaluator: validate
evaluatorConfig:
type: judge
model: gpt-4o
criteria:
accuracy: "Content is factually accurate"
completeness: "All key points are covered"
clarity: "Writing is clear and easy to understand"
grammar: "Perfect grammar and spelling"
thresholds:
minimum: 0.8
onFailure: retry
retryLimit: 5
requireImprovement: true # Each retry must score higher
output:
content: ${generate-content.output.text}
quality:
score: ${generate-content.scoring.score}
passed: ${generate-content.scoring.passed}
attempts: ${generate-content.scoring.attempts}
feedback: ${generate-content.scoring.feedback}
Progressive Improvement
Each retry must score higher than the previous:Copy
name: progressive-improvement
description: Content quality improves with each retry
flow:
- member: generate-article
type: Think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
scoring:
evaluator: validate
thresholds:
minimum: 0.85
requireImprovement: true
minImprovement: 0.05 # Must improve by at least 5%
onFailure: retry
retryLimit: 5
# Example progression:
# Attempt 1: 0.70 ❌ (below threshold, retry)
# Attempt 2: 0.75 ✅ (improved, but still below, retry)
# Attempt 3: 0.73 ❌ (decreased, reject)
# Attempt 4: 0.80 ✅ (improved, but still below, retry)
# Attempt 5: 0.87 ✅ (improved and passed!)
output:
article: ${generate-article.output}
finalScore: ${generate-article.scoring.score}
attempts: ${generate-article.scoring.attempts}
improvement: ${generate-article.scoring.score - generate-article.scoring.initialScore}
Rule-Based Scoring
Fast, deterministic validation:Copy
name: validate-structured-output
description: Rule-based validation for structured data
flow:
- member: extract-company-data
type: Think
config:
provider: openai
model: gpt-4o
responseFormat:
type: json_object
scoring:
evaluator: validate
evaluatorConfig:
type: rule
criteria:
hasName:
rule: "output.name != null && output.name.length > 0"
weight: 0.3
hasIndustry:
rule: "output.industry != null"
weight: 0.2
hasEmployees:
rule: "output.employees > 0"
weight: 0.2
hasFounded:
rule: "output.founded >= 1800 && output.founded <= 2024"
weight: 0.3
thresholds:
minimum: 0.8
onFailure: retry
retryLimit: 3
output:
companyData: ${extract-company-data.output}
validationScore: ${extract-company-data.scoring.score}
breakdown: ${extract-company-data.scoring.breakdown}
LLM Judge Scoring
AI-powered quality assessment:Copy
name: content-quality-judge
description: AI evaluates content quality
flow:
- member: write-blog-post
type: Think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
scoring:
evaluator: validate
evaluatorConfig:
type: judge
model: claude-3-5-sonnet-20241022
systemPrompt: |
You are an expert content quality evaluator.
Evaluate blog posts on multiple dimensions.
Be strict but fair.
criteria:
engagement: "Content is engaging and maintains reader interest"
structure: "Clear structure with logical flow"
depth: "Sufficient depth and detail on the topic"
originality: "Original insights, not generic content"
actionability: "Provides actionable takeaways"
thresholds:
minimum: 0.85
onFailure: retry
retryLimit: 5
requireImprovement: true
output:
blogPost: ${write-blog-post.output}
quality: ${write-blog-post.scoring}
NLP-Based Scoring
Natural language metrics:Copy
name: sentiment-and-readability
description: Validate sentiment and readability
flow:
- member: generate-customer-response
type: Think
config:
provider: openai
model: gpt-4o-mini
scoring:
evaluator: validate
evaluatorConfig:
type: nlp
criteria:
sentiment:
target: "positive"
weight: 0.4
readability:
target: "easy"
weight: 0.3
tone:
target: "professional"
weight: 0.3
thresholds:
minimum: 0.8
onFailure: retry
retryLimit: 3
output:
response: ${generate-customer-response.output}
sentimentScore: ${generate-customer-response.scoring.breakdown.sentiment}
readabilityScore: ${generate-customer-response.scoring.breakdown.readability}
Embedding Similarity
Compare against reference examples:Copy
name: style-matching
description: Match reference style
flow:
- member: generate-description
type: Think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
scoring:
evaluator: validate
evaluatorConfig:
type: embedding
referenceExamples:
- "Acme Corp revolutionizes enterprise software with AI-powered automation that reduces manual work by 80%. Founded in 2020, they serve over 500 Fortune 1000 companies."
- "TechStart delivers cutting-edge cloud solutions for modern businesses. Their platform processes 10 billion transactions daily across 50 countries."
similarityThreshold: 0.75
thresholds:
minimum: 0.8
onFailure: retry
retryLimit: 4
output:
description: ${generate-description.output}
similarityScore: ${generate-description.scoring.score}
Multi-Stage Validation
Different quality checks at each stage:Copy
name: multi-stage-content-creation
description: Progressive quality gates
flow:
# Stage 1: Generate outline (lenient)
- member: generate-outline
type: Think
scoring:
evaluator: validate
evaluatorConfig:
type: rule
criteria:
hasSections:
rule: "output.sections.length >= 3"
weight: 0.5
hasIntro:
rule: "output.sections.includes('Introduction')"
weight: 0.25
hasConclusion:
rule: "output.sections.includes('Conclusion')"
weight: 0.25
thresholds:
minimum: 0.7
onFailure: retry
retryLimit: 3
# Stage 2: Generate content (moderate)
- member: generate-content
type: Think
input:
outline: ${generate-outline.output}
scoring:
evaluator: validate
evaluatorConfig:
type: judge
model: gpt-4o
criteria:
completeness: "All outline sections are developed"
coherence: "Content flows logically"
quality: "Writing is professional"
thresholds:
minimum: 0.8
onFailure: retry
retryLimit: 5
requireImprovement: true
# Stage 3: Polish content (strict)
- member: polish-content
type: Think
input:
content: ${generate-content.output}
scoring:
evaluator: validate
evaluatorConfig:
type: judge
model: claude-3-5-sonnet-20241022
criteria:
grammar: "Perfect grammar and spelling"
style: "Consistent professional style"
engagement: "Highly engaging"
accuracy: "Factually accurate"
thresholds:
minimum: 0.9
onFailure: retry
retryLimit: 5
requireImprovement: true
output:
finalContent: ${polish-content.output}
quality:
outline: ${generate-outline.scoring.score}
content: ${generate-content.scoring.score}
polished: ${polish-content.scoring.score}
totalAttempts: ${generate-outline.scoring.attempts + generate-content.scoring.attempts + polish-content.scoring.attempts}
Failure Handling Strategies
Continue on Low Quality
Copy
scoring:
onFailure: continue # Log but don't fail workflow
Abort on Low Quality
Copy
scoring:
onFailure: abort # Stop execution immediately
Retry with Backoff
Copy
scoring:
onFailure: retry
retryLimit: 5
backoffStrategy: exponential # 1s, 2s, 4s, 8s, 16s
Cost Optimization
Fast Pre-Check with Rules
Copy
flow:
# Fast rule-based pre-check
- member: generate-content
scoring:
evaluatorConfig:
type: rule
criteria:
minLength:
rule: "output.text.length >= 100"
weight: 1.0
thresholds:
minimum: 1.0
onFailure: retry
retryLimit: 2
# Only run expensive AI judge if rules pass
- member: final-quality-check
input:
content: ${generate-content.output}
scoring:
evaluatorConfig:
type: judge
model: gpt-4o
thresholds:
minimum: 0.85
onFailure: retry
retryLimit: 3
Cache Evaluations
Copy
- member: generate-content
cache:
ttl: 3600 # Cache generated content
scoring:
# Scoring results are also cached
evaluator: validate
Adjust Temperature
Copy
config:
temperature: 0.3 # Lower = more deterministic = fewer retries needed
Testing Scoring
Copy
import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';
describe('scoring and retry', () => {
it('should pass on high quality', async () => {
const conductor = await TestConductor.create({
mocks: {
ai: {
responses: {
'generate-content': {
text: 'High quality content',
score: 0.95
}
}
}
}
});
const result = await conductor.executeEnsemble('generate-quality-content', {
topic: 'AI'
});
expect(result).toBeSuccessful();
expect(result.output.quality.score).toBeGreaterThan(0.8);
expect(result.output.quality.passed).toBe(true);
expect(result.output.quality.attempts).toBe(1);
});
it('should retry on low quality', async () => {
let attempts = 0;
const conductor = await TestConductor.create({
mocks: {
ai: {
handler: async () => {
attempts++;
return {
text: 'Content',
score: attempts === 1 ? 0.6 : 0.85 // Low then high
};
}
}
}
});
const result = await conductor.executeEnsemble('generate-quality-content', {
topic: 'AI'
});
expect(result).toBeSuccessful();
expect(result.output.quality.attempts).toBe(2);
expect(result.output.quality.passed).toBe(true);
});
it('should require progressive improvement', async () => {
let attempts = 0;
const conductor = await TestConductor.create({
mocks: {
ai: {
handler: async () => {
attempts++;
// Scores: 0.7, 0.75, 0.73 (rejected), 0.80, 0.87
const scores = [0.7, 0.75, 0.73, 0.80, 0.87];
return {
text: 'Content',
score: scores[attempts - 1]
};
}
}
}
});
const result = await conductor.executeEnsemble('progressive-improvement', {
topic: 'AI'
});
expect(result).toBeSuccessful();
expect(result.output.finalScore).toBeGreaterThanOrEqual(0.85);
expect(result.output.attempts).toBeLessThanOrEqual(5);
});
});
Monitoring Scoring Metrics
Copy
const result = await executor.executeEnsemble('generate-quality-content', input);
// Track scoring effectiveness
console.log('Quality Metrics:', {
score: result.output.quality.score,
passed: result.output.quality.passed,
attempts: result.output.quality.attempts,
avgScore: result.metrics?.scoringMetrics?.avgScore,
successRate: result.metrics?.scoringMetrics?.successRate,
costPerSuccess: result.metrics?.scoringMetrics?.avgCost
});
// Alert if quality is consistently low
if (result.output.quality.attempts > 3) {
console.warn('High retry count - consider adjusting thresholds or prompts');
}
Best Practices
- Start lenient, tighten gradually - Begin with 0.6-0.7, increase based on data
- Use multiple evaluators - Fast rules first, then AI judge
- Weight criteria appropriately - Critical features get higher weight
- Set reasonable retry limits - 3-5 retries balances quality and cost
- Monitor metrics - Track success rates and costs
- Cache evaluations - Reduce redundant AI judge calls
- Require improvement - Each retry should score higher
- Test thoroughly - Verify scoring works as expected

