Overview
This guide provides practical patterns for implementing quality scoring in your workflows. Learn when to use each evaluator type, how to set appropriate thresholds, and optimize for cost and performance. See Scoring Concept for foundational concepts.When to Use Scoring
✅ Good Use Cases
- AI-generated content - Validate quality, tone, accuracy
- Data extraction - Ensure all required fields present
- Content moderation - Check for inappropriate content
- Translation quality - Validate translation accuracy
- Summarization - Verify completeness and clarity
❌ Not Needed
- Simple calculations - Deterministic operations don’t need scoring
- Data retrieval - Database queries are either successful or not
- API calls - HTTP status codes indicate success/failure
Choosing an Evaluator
Rule-Based Evaluator
When to use:- Fast validation needed (< 1ms)
- Clear, objective criteria
- Structured data with known schema
- Cost is a concern
LLM Judge Evaluator
When to use:- Subjective quality assessment
- Natural language evaluation
- Complex multi-dimensional criteria
- Human-like judgment needed
NLP Evaluator
When to use:- Sentiment analysis
- Readability scoring
- Keyword presence checking
- Language-specific metrics
Embedding Similarity
When to use:- Style matching
- Consistency checking
- Example-based validation
- Semantic similarity
Setting Thresholds
Start Lenient, Tighten Gradually
Different Thresholds by Importance
Weighting Criteria
Equal Weights
Custom Weights
Retry Strategies
Exponential Backoff (Recommended)
- AI provider rate limits
- Non-deterministic failures
- Production workloads
Linear Backoff
- Predictable retry timing
- Simple scenarios
No Backoff (Fixed)
- Testing
- Immediate retries desired
Progressive Improvement
Require each retry to score higher:Failure Handling
Retry (Default)
Continue
Abort
Layered Validation
Fast checks first, expensive checks later:Cost Optimization
Use Cheaper Models for Judging
Cache Evaluations
Lower Temperature for Consistency
Set Reasonable Retry Limits
Real-World Patterns
Blog Post Generation
Data Extraction
Customer Response
Monitoring and Debugging
Track Scoring Metrics
Log Scoring Details
Testing Scoring
Best Practices
- Start with rule-based - Fast and free for objective criteria
- Use AI judges sparingly - Only when subjective assessment needed
- Set lenient initial thresholds - Tighten based on data
- Require improvement - Each retry should score higher
- Layer validations - Fast checks first, expensive later
- Cache aggressively - Reduce redundant evaluations
- Monitor metrics - Track success rates and costs
- Test thoroughly - Verify scoring works as expected

