A/B Testing Prompts
Compare prompts systematically. Find what works best. Deploy instantly.
Overview
A/B testing is a core architectural principle of Ensemble Edge. By combining Edgit’s multiverse versioning with Conductor’s edge-native execution, you can test prompt variations at unprecedented scale and speed.
Key Advantage: Version prompts independently, test optimal combinations from different points in history, and deploy winners globally in < 50ms.
Basic Prompt A/B Test
Test two prompt versions with 50/50 traffic split:
ensemble: test-extraction-prompts
agents:
# Route based on user ID for consistent experience
- name: route-variant
operation: code
config:
script: |
return {
variant: input.userId % 2 === 0 ? 'a' : 'b'
};
# Variant A: Conservative prompt
- name: extract-a
condition: ${route-variant.output.variant === 'a'}
operation: think
component: extraction-prompt@v1.0.0 # Versioned prompt
config:
model: claude-3-5-sonnet-20241022
provider: anthropic
input:
document: ${input.document}
# Variant B: Aggressive prompt
- name: extract-b
condition: ${route-variant.output.variant === 'b'}
operation: think
component: extraction-prompt@v2.0.0 # New version
config:
model: claude-3-5-sonnet-20241022
provider: anthropic
input:
document: ${input.document}
# Log results for analysis
- name: log-results
operation: storage
config:
type: d1
query: |
INSERT INTO ab_test_results
(user_id, variant, execution_time, quality_score, timestamp)
VALUES (?, ?, ?, ?, ?)
params:
- ${input.userId}
- ${route-variant.output.variant}
- ${extract-a.executionTime || extract-b.executionTime}
- ${extract-a.score || extract-b.score}
- ${Date.now()}
Deploy both versions:
# Deploy variant A (conservative)
edgit deploy set extraction-prompt v1.0.0 --to prod
# Deploy variant B (aggressive)
edgit deploy set extraction-prompt v2.0.0 --to prod
# Both versions live simultaneously at the edge
Model Comparison
Test different AI models for optimal cost/quality balance:
ensemble: compare-models
agents:
- name: route
operation: code
config:
script: |
const hash = input.userId.split('').reduce((a, b) => {
return ((a << 5) - a) + b.charCodeAt(0);
}, 0);
return { variant: Math.abs(hash) % 3 };
# Variant A: GPT-4o (high quality, high cost)
- name: analyze-gpt4
condition: ${route.output.variant === 0}
operation: think
component: analysis-prompt@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.3
input:
data: ${input.data}
# Variant B: Claude Sonnet (balanced)
- name: analyze-claude
condition: ${route.output.variant === 1}
operation: think
component: analysis-prompt@v1.0.0 # Same prompt, different model
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.3
input:
data: ${input.data}
# Variant C: Workers AI (low cost, edge-native)
- name: analyze-workersai
condition: ${route.output.variant === 2}
operation: ml
component: analysis-prompt@v1.0.0
config:
model: '@cf/meta/llama-3.1-8b-instruct'
provider: cloudflare
input:
data: ${input.data}
# Track metrics
- name: track-metrics
operation: code
config:
script: |
const result = input.result;
const variant = ['gpt4', 'claude', 'workersai'][input.variantIndex];
env.ANALYTICS?.writeDataPoint({
blobs: ['model-comparison', variant, input.userId],
doubles: [result.executionTime, result.cost, result.qualityScore]
});
return { logged: true };
input:
variantIndex: ${route.output.variant}
result: ${analyze-gpt4.output || analyze-claude.output || analyze-workersai.output}
Measure:
- Quality scores (Conductor’s built-in scoring)
- Execution time and cost
- User satisfaction metrics
- Downstream conversion rates
Temperature Optimization
Test different temperature settings for creativity vs consistency:
ensemble: optimize-temperature
agents:
- name: assign-temperature
operation: code
config:
script: |
const temps = [0.3, 0.5, 0.7, 0.9];
const index = input.userId % temps.length;
return { temperature: temps[index] };
- name: generate-content
operation: think
component: creative-prompt@v1.0.0
config:
model: claude-3-5-sonnet-20241022
provider: anthropic
temperature: ${assign-temperature.output.temperature}
input:
topic: ${input.topic}
# Validate output quality
- name: validate-quality
operation: think
config:
model: gpt-4o-mini
provider: openai
systemPrompt: |
Evaluate content quality on:
- Creativity
- Coherence
- Factual accuracy
- Engagement
Return score 0.0-1.0
input:
content: ${generate-content.output.text}
temperature: ${assign-temperature.output.temperature}
- name: log-temperature-test
operation: storage
config:
type: d1
query: |
INSERT INTO temperature_tests
(temperature, quality_score, creativity_score, coherence_score)
VALUES (?, ?, ?, ?)
params:
- ${assign-temperature.output.temperature}
- ${validate-quality.output.overallScore}
- ${validate-quality.output.creativityScore}
- ${validate-quality.output.coherenceScore}
Prompt Structure Testing
Test different prompt structures:
ensemble: test-prompt-structures
agents:
- name: select-structure
operation: code
config:
script: |
const structures = ['chain-of-thought', 'direct', 'few-shot'];
return { structure: structures[input.userId % structures.length] };
# Chain of thought
- name: cot-prompt
condition: ${select-structure.output.structure === 'chain-of-thought'}
operation: think
component: cot-extraction@v1.0.0
config:
model: gpt-4o
provider: openai
input:
document: ${input.document}
# Direct instruction
- name: direct-prompt
condition: ${select-structure.output.structure === 'direct'}
operation: think
component: direct-extraction@v1.0.0
config:
model: gpt-4o
provider: openai
input:
document: ${input.document}
# Few-shot examples
- name: fewshot-prompt
condition: ${select-structure.output.structure === 'few-shot'}
operation: think
component: fewshot-extraction@v1.0.0
config:
model: gpt-4o
provider: openai
input:
document: ${input.document}
Prompt examples:
<!-- components/prompts/cot-extraction-v1.0.0.md -->
Extract key information from the document.
Think through this step by step:
1. First, identify the document type
2. Then, locate key data points
3. Finally, structure the extracted information
Document: {{document}}
<!-- components/prompts/direct-extraction-v1.0.0.md -->
Extract the following from the document:
- Name
- Date
- Amount
- Description
Document: {{document}}
<!-- components/prompts/fewshot-extraction-v1.0.0.md -->
Extract key information as shown in these examples:
Example 1:
Input: "Invoice #123 dated 2024-01-15 for $500"
Output: {"id": "123", "date": "2024-01-15", "amount": 500}
Example 2:
Input: "Receipt from 01/20/2024 total: $75.50"
Output: {"date": "2024-01-20", "amount": 75.50}
Now extract from: {{document}}
Multivariate Testing
Test multiple dimensions simultaneously:
ensemble: multivariate-prompt-test
agents:
- name: assign-variant
operation: code
config:
script: |
const hash = input.userId;
const prompts = ['conservative', 'aggressive'];
const models = ['gpt-4o', 'claude-sonnet'];
const temps = [0.3, 0.7];
return {
prompt: prompts[hash % 2],
model: models[Math.floor(hash / 2) % 2],
temperature: temps[Math.floor(hash / 4) % 2],
variant: `${prompts[hash % 2]}-${models[Math.floor(hash / 2) % 2]}-${temps[Math.floor(hash / 4) % 2]}`
};
- name: extract
operation: think
component: extraction-${assign-variant.output.prompt}@v1.0.0
config:
model: ${assign-variant.output.model}
provider: ${assign-variant.output.model.includes('gpt') ? 'openai' : 'anthropic'}
temperature: ${assign-variant.output.temperature}
input:
document: ${input.document}
- name: log-multivariate
operation: storage
config:
type: analytics
dataPoint:
blobs:
- multivariate-test
- ${assign-variant.output.variant}
doubles:
- ${extract.executionTime}
- ${extract.cost}
- ${extract.qualityScore}
This tests 2 × 2 × 2 = 8 variants:
- conservative + gpt-4o + 0.3
- conservative + gpt-4o + 0.7
- conservative + claude + 0.3
- conservative + claude + 0.7
- aggressive + gpt-4o + 0.3
- aggressive + gpt-4o + 0.7
- aggressive + claude + 0.3
- aggressive + claude + 0.7
Progressive Rollout
Gradually increase traffic to winning variant:
ensemble: progressive-rollout
agents:
- name: get-rollout-percentage
operation: storage
config:
type: kv
key: prompt-v2-rollout-percentage
defaultValue: 5 # Start at 5%
- name: select-version
operation: code
config:
script: |
const rolloutPct = input.rolloutPercentage;
const random = Math.random() * 100;
return {
version: random < rolloutPct ? 'v2.0.0' : 'v1.0.0',
isNewVersion: random < rolloutPct
};
input:
rolloutPercentage: ${get-rollout-percentage.output.value}
- name: extract
operation: think
component: extraction-prompt@${select-version.output.version}
config:
model: claude-3-5-sonnet-20241022
provider: anthropic
input:
document: ${input.document}
# Auto-increment rollout if quality good
- name: update-rollout
condition: ${select-version.output.isNewVersion && extract.qualityScore > 0.9}
operation: code
config:
script: |
const currentPct = input.currentPercentage;
const newPct = Math.min(currentPct + 5, 100);
await env.KV.put('prompt-v2-rollout-percentage', newPct.toString());
return {
oldPercentage: currentPct,
newPercentage: newPct
};
input:
currentPercentage: ${get-rollout-percentage.output.value}
Measuring Results
Built-in Quality Scoring
Every execution automatically emits quality metrics:
ensemble: scored-comparison
agents:
- name: extract
operation: think
component: extraction-prompt@${input.version}
config:
model: claude-3-5-sonnet-20241022
scoring:
enabled: true
criteria:
accuracy: "Extracted data matches source"
completeness: "All required fields present"
format: "Output follows JSON schema"
thresholds:
minimum: 0.8
Conductor automatically tracks:
- Quality score (0.0-1.0)
- Execution time (ms)
- Token usage and cost
- Cache hit/miss
- Retry count
Analytics Engine
Query test results:
-- Compare variants
SELECT
blob2 as variant,
COUNT(*) as executions,
AVG(double1) as avg_execution_time_ms,
AVG(double2) as avg_cost,
AVG(double3) as avg_quality_score
FROM analytics
WHERE blob1 = 'prompt-ab-test'
AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY variant
ORDER BY avg_quality_score DESC
Statistical Significance
Check if results are significant:
import { tTest } from '@ensemble-edge/conductor/utils';
// Get results for both variants
const variantA = await getResults('v1.0.0');
const variantB = await getResults('v2.0.0');
// Require minimum samples
const MIN_SAMPLES = 1000;
if (variantA.samples < MIN_SAMPLES || variantB.samples < MIN_SAMPLES) {
console.log('Need more samples');
return;
}
// Calculate p-value
const pValue = tTest(variantA.scores, variantB.scores);
const CONFIDENCE = 0.95;
if (pValue < (1 - CONFIDENCE)) {
const winner = variantA.mean > variantB.mean ? 'A' : 'B';
console.log(`Variant ${winner} wins with ${CONFIDENCE * 100}% confidence`);
// Promote winner
await promoteVersion(winner === 'A' ? 'v1.0.0' : 'v2.0.0');
}
Edge-Native Advantages
Instant Deployment
# Deploy new prompt version globally in < 50ms
edgit deploy set extraction-prompt v2.0.0 --to prod
# Instant rollback if quality drops
edgit deploy set extraction-prompt v1.0.0 --to prod
No build step. No container deployment. No waiting.
Geographic Testing
Test variants by region:
export default {
async fetch(request: Request, env: Env) {
const colo = request.cf?.colo; // Airport code (e.g., "SJC")
// Route US West to variant B, rest to variant A
const variant = ['SJC', 'LAX', 'SEA'].includes(colo) ? 'v2.0.0' : 'v1.0.0';
return conductorClient.execute({
ensemble: 'extraction-pipeline',
input: {
...input,
promptVersion: variant
}
});
}
};
Deterministic User Routing
Ensure consistent experience:
function hashUserId(userId: string): number {
return userId.split('').reduce((acc, char) => {
return ((acc << 5) - acc) + char.charCodeAt(0);
}, 0);
}
function getVariant(userId: string): 'a' | 'b' {
const hash = hashUserId(userId);
return Math.abs(hash) % 100 < 50 ? 'a' : 'b';
}
// Same user always gets same variant
const variant = getVariant(request.headers.get('x-user-id'));
Best Practices
1. Version Prompts in Edgit
# Register prompt as component
edgit components add prompt extraction-prompt components/prompts/extraction.md
# Create versions
edgit tag create extraction-prompt v1.0.0
edgit tag create extraction-prompt v2.0.0
# Deploy independently
edgit deploy set extraction-prompt v1.0.0 --to prod
2. Test One Dimension at a Time
Start simple:
# ✅ Good: Test prompt only
variant-a: extraction-prompt@v1.0.0 + gpt-4o + temp 0.3
variant-b: extraction-prompt@v2.0.0 + gpt-4o + temp 0.3
# ❌ Bad: Change everything
variant-a: extraction-prompt@v1.0.0 + gpt-4o + temp 0.3
variant-b: different-prompt@v1.0.0 + claude + temp 0.7
3. Require Minimum Samples
Don’t declare winners prematurely:
const MIN_SAMPLES = 1000;
const MIN_DAYS = 7;
if (samples < MIN_SAMPLES || daysSinceStart < MIN_DAYS) {
return { status: 'collecting_data' };
}
4. Monitor Business Metrics
Track actual outcomes:
// ✅ Good
await logMetric({
variant,
qualityScore: result.score,
userConverted: await checkConversion(userId),
revenue: await getRevenue(userId)
});
// ❌ Bad - optimize proxy without checking actual goal
await logMetric({
variant,
clicks: clickCount
});
5. Use Gradual Rollout
Start small:
Week 1: 5% traffic to variant B
Week 2: 25% if successful
Week 3: 50% if still successful
Week 4: 100% (promote to default)
Testing
import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';
describe('prompt-ab-test', () => {
it('should route variants correctly', async () => {
const conductor = await TestConductor.create();
// Test variant A
const resultA = await conductor.executeEnsemble('test-extraction-prompts', {
userId: 2, // Even = variant A
document: 'Test document'
});
expect(resultA.output.variant).toBe('a');
// Test variant B
const resultB = await conductor.executeEnsemble('test-extraction-prompts', {
userId: 3, // Odd = variant B
document: 'Test document'
});
expect(resultB.output.variant).toBe('b');
});
it('should log results correctly', async () => {
const conductor = await TestConductor.create();
await conductor.executeEnsemble('test-extraction-prompts', {
userId: 123,
document: 'Test'
});
const logs = await conductor.getStorageWrites('d1');
expect(logs).toHaveLength(1);
expect(logs[0].table).toBe('ab_test_results');
});
});
Next Steps
A/B testing is not a feature—it’s a fundamental architectural capability of Ensemble Edge. Version prompts independently, deploy instantly at the edge, and let Conductor measure quality automatically.