Multivariate Agent Testing Playbook
Test combinations: models × prompts × temperatures × agents. Find the optimal configuration faster. Multivariate testing goes beyond simple A/B tests by testing multiple variables simultaneously. Instead of testing 2 variants, test 4, 8, or even 16 variants to find the best combination of model, prompt, temperature, and agent configuration.Why Multivariate Testing?
A/B Testing Limitation:- Test prompt A vs prompt B with model X
- Then test model X vs model Y with prompt A
- Sequential testing takes weeks
- Test all combinations simultaneously:
- Prompt A + Model X
- Prompt A + Model Y
- Prompt B + Model X
- Prompt B + Model Y
- Find optimal combination in days
2×2 Test (4 Variants)
Test 2 models × 2 prompts = 4 combinations:Copy
ensemble: multivariate-2x2
description: Test 2 models × 2 prompts
agents:
# Model A (GPT-4o) + Prompt A (25%)
- name: variant-aa
condition: ${input.user_id % 4 === 0}
operation: think
component: prompt-a@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.7
# Model A (GPT-4o) + Prompt B (25%)
- name: variant-ab
condition: ${input.user_id % 4 === 1}
operation: think
component: prompt-b@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.7
# Model B (Claude) + Prompt A (25%)
- name: variant-ba
condition: ${input.user_id % 4 === 2}
operation: think
component: prompt-a@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.7
# Model B (Claude) + Prompt B (25%)
- name: variant-bb
condition: ${input.user_id % 4 === 3}
operation: think
component: prompt-b@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.7
# Log results
- name: log-variant
operation: storage
config:
type: d1
query: |
INSERT INTO multivariate_results (
user_id, variant, model, prompt, latency, timestamp
) VALUES (?, ?, ?, ?, ?, ?)
params:
- ${input.user_id}
- ${variant-aa.executed ? 'aa' : variant-ab.executed ? 'ab' : variant-ba.executed ? 'ba' : 'bb'}
- ${variant-aa.executed || variant-ab.executed ? 'gpt-4o' : 'claude'}
- ${variant-aa.executed || variant-ba.executed ? 'prompt-a' : 'prompt-b'}
- ${variant-aa.latency || variant-ab.latency || variant-ba.latency || variant-bb.latency}
- ${Date.now()}
output:
variant: ${variant-aa.executed ? 'aa' : variant-ab.executed ? 'ab' : variant-ba.executed ? 'ba' : 'bb'}
result: ${variant-aa.output || variant-ab.output || variant-ba.output || variant-bb.output}
Copy
SELECT
variant,
model,
prompt,
COUNT(*) as requests,
AVG(latency) as avg_latency,
SUM(CASE WHEN feedback_score > 0.8 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as quality_rate
FROM multivariate_results
WHERE timestamp > strftime('%s', 'now', '-7 days') * 1000
GROUP BY variant, model, prompt
ORDER BY quality_rate DESC, avg_latency ASC;
2×2×2 Test (8 Variants)
Test 2 models × 2 prompts × 2 temperatures = 8 combinations:Copy
ensemble: multivariate-2x2x2
description: Test 2 models × 2 prompts × 2 temperatures
agents:
# GPT-4o + Prompt A + Temp 0.3 (12.5%)
- name: variant-aa-low
condition: ${input.user_id % 8 === 0}
operation: think
component: prompt-a@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.3
# GPT-4o + Prompt A + Temp 0.9 (12.5%)
- name: variant-aa-high
condition: ${input.user_id % 8 === 1}
operation: think
component: prompt-a@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.9
# GPT-4o + Prompt B + Temp 0.3 (12.5%)
- name: variant-ab-low
condition: ${input.user_id % 8 === 2}
operation: think
component: prompt-b@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.3
# GPT-4o + Prompt B + Temp 0.9 (12.5%)
- name: variant-ab-high
condition: ${input.user_id % 8 === 3}
operation: think
component: prompt-b@v1.0.0
config:
provider: openai
model: gpt-4o
temperature: 0.9
# Claude + Prompt A + Temp 0.3 (12.5%)
- name: variant-ba-low
condition: ${input.user_id % 8 === 4}
operation: think
component: prompt-a@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.3
# Claude + Prompt A + Temp 0.9 (12.5%)
- name: variant-ba-high
condition: ${input.user_id % 8 === 5}
operation: think
component: prompt-a@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.9
# Claude + Prompt B + Temp 0.3 (12.5%)
- name: variant-bb-low
condition: ${input.user_id % 8 === 6}
operation: think
component: prompt-b@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.3
# Claude + Prompt B + Temp 0.9 (12.5%)
- name: variant-bb-high
condition: ${input.user_id % 8 === 7}
operation: think
component: prompt-b@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.9
Agent Version Testing
Test multiple versions of the same agent:Copy
ensemble: test-agent-versions
description: Compare agent v1.0.0 vs v2.0.0 vs v3.0.0
agents:
# Route 33% to each version
- name: v1
condition: ${input.user_id % 3 === 0}
agent: my-agent@v1.0.0
input:
data: ${input.data}
- name: v2
condition: ${input.user_id % 3 === 1}
agent: my-agent@v2.0.0
input:
data: ${input.data}
- name: v3
condition: ${input.user_id % 3 === 2}
agent: my-agent@v3.0.0
input:
data: ${input.data}
# Track which version was used
- name: log-version
operation: storage
config:
type: d1
query: |
INSERT INTO agent_versions (
user_id, version, latency, success, timestamp
) VALUES (?, ?, ?, ?, ?)
params:
- ${input.user_id}
- ${v1.executed ? 'v1.0.0' : v2.executed ? 'v2.0.0' : 'v3.0.0'}
- ${v1.latency || v2.latency || v3.latency}
- ${v1.success || v2.success || v3.success}
- ${Date.now()}
output:
version: ${v1.executed ? 'v1.0.0' : v2.executed ? 'v2.0.0' : 'v3.0.0'}
result: ${v1.output || v2.output || v3.output}
Multi-Agent Coordination Testing
Test different multi-agent coordination patterns:Parallel Research Agents
Copy
ensemble: multi-agent-research
description: Coordinate research, analysis, and writing agents
state:
schema:
researchFindings: array
analysisResults: object
synthesis: string
agents:
# Phase 1: Parallel Research by Specialized Agents
- parallel:
- name: research-technical
operation: think
config:
provider: openai
model: gpt-4o
prompt: |
You are a technical research specialist.
Research technical aspects of: ${input.topic}
Provide detailed technical information and specifications.
state:
set: [researchFindings]
- name: research-market
operation: think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
prompt: |
You are a market research specialist.
Research market aspects of: ${input.topic}
Analyze trends, adoption rates, and business implications.
state:
set: [researchFindings]
- name: research-competitive
operation: think
config:
provider: openai
model: gpt-4o
prompt: |
You are a competitive analysis specialist.
Research competitive landscape for: ${input.topic}
Identify competitors and comparative advantages.
state:
set: [researchFindings]
# Phase 2: Synthesis Agent Combines Research
- name: synthesize-research
operation: think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
prompt: |
You are a research synthesis specialist.
Combine multiple research perspectives into cohesive insights.
Technical Research: ${research-technical.output}
Market Research: ${research-market.output}
Competitive Research: ${research-competitive.output}
Provide 5-7 key insights that integrate all perspectives.
state:
use: [researchFindings]
set: [synthesis]
# Phase 3: Parallel Analysis by Domain Experts
- parallel:
- name: analyze-opportunities
operation: think
config:
provider: openai
model: gpt-4o
prompt: |
You are a strategic opportunity analyst.
Based on these insights, identify opportunities:
${synthesize-research.output}
state:
set: [analysisResults]
- name: analyze-risks
operation: think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
prompt: |
You are a risk assessment specialist.
Based on these insights, analyze risks:
${synthesize-research.output}
state:
set: [analysisResults]
- name: analyze-recommendations
operation: think
config:
provider: openai
model: gpt-4o
prompt: |
You are a strategic advisor.
Based on these insights, provide actionable recommendations:
${synthesize-research.output}
state:
set: [analysisResults]
# Phase 4: Generate Final Report
- name: generate-report
operation: think
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
prompt: |
Generate a comprehensive report based on:
Synthesis: ${synthesize-research.output}
Opportunities: ${analyze-opportunities.output}
Risks: ${analyze-risks.output}
Recommendations: ${analyze-recommendations.output}
output:
report: ${generate-report.output}
Testing Different Coordination Patterns
Test sequential vs parallel vs hybrid coordination:Copy
ensemble: test-coordination-patterns
description: Compare different multi-agent coordination approaches
agents:
# Pattern A: Sequential (33%)
- name: pattern-sequential
condition: ${input.user_id % 3 === 0}
operation: code
config:
code: |
// Execute agents sequentially
const agent1 = await executeAgent('agent-1', input);
const agent2 = await executeAgent('agent-2', agent1.output);
const agent3 = await executeAgent('agent-3', agent2.output);
return agent3.output;
# Pattern B: Parallel (33%)
- parallel:
- name: pattern-parallel-1
condition: ${input.user_id % 3 === 1}
operation: think
- name: pattern-parallel-2
condition: ${input.user_id % 3 === 1}
operation: think
- name: pattern-parallel-3
condition: ${input.user_id % 3 === 1}
operation: think
# Pattern C: Hybrid (33%)
- name: pattern-hybrid-fetch
condition: ${input.user_id % 3 === 2}
operation: http
- parallel:
- name: pattern-hybrid-analyze-1
condition: ${input.user_id % 3 === 2}
operation: think
input: ${pattern-hybrid-fetch.output}
- name: pattern-hybrid-analyze-2
condition: ${input.user_id % 3 === 2}
operation: think
input: ${pattern-hybrid-fetch.output}
- name: pattern-hybrid-synthesize
condition: ${input.user_id % 3 === 2}
operation: think
input:
analyze1: ${pattern-hybrid-analyze-1.output}
analyze2: ${pattern-hybrid-analyze-2.output}
Component Version Matrix Testing
Test all combinations of component versions:Copy
ensemble: component-version-matrix
description: Test prompt@v1 vs v2 with config@v1 vs v2
agents:
# Prompt v1 + Config v1 (25%)
- name: variant-p1c1
condition: ${input.user_id % 4 === 0}
operation: think
component: my-prompt@v1.0.0
config:
provider: openai
model: ${component.model-config@v1.0.0.model}
temperature: ${component.model-config@v1.0.0.temperature}
# Prompt v1 + Config v2 (25%)
- name: variant-p1c2
condition: ${input.user_id % 4 === 1}
operation: think
component: my-prompt@v1.0.0
config:
provider: openai
model: ${component.model-config@v2.0.0.model}
temperature: ${component.model-config@v2.0.0.temperature}
# Prompt v2 + Config v1 (25%)
- name: variant-p2c1
condition: ${input.user_id % 4 === 2}
operation: think
component: my-prompt@v2.0.0
config:
provider: openai
model: ${component.model-config@v1.0.0.model}
temperature: ${component.model-config@v1.0.0.temperature}
# Prompt v2 + Config v2 (25%)
- name: variant-p2c2
condition: ${input.user_id % 4 === 3}
operation: think
component: my-prompt@v2.0.0
config:
provider: openai
model: ${component.model-config@v2.0.0.model}
temperature: ${component.model-config@v2.0.0.temperature}
Real-World Example: Customer Support Analysis
Test different agent combinations for customer support ticket analysis:Copy
ensemble: support-ticket-analysis
description: Test classification + sentiment + priority agents
agents:
# Variant A: Single unified agent (33%)
- name: unified-agent
condition: ${input.ticket_id % 3 === 0}
operation: think
component: unified-support-prompt@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
input:
ticket: ${input.ticket_text}
# Variant B: Specialized parallel agents (33%)
- parallel:
- name: classify-ticket
condition: ${input.ticket_id % 3 === 1}
operation: think
component: classification-prompt@v1.0.0
config:
provider: openai
model: gpt-4o-mini
input:
ticket: ${input.ticket_text}
- name: analyze-sentiment
condition: ${input.ticket_id % 3 === 1}
operation: think
component: sentiment-prompt@v1.0.0
config:
provider: cloudflare
model: '@cf/meta/llama-3.1-8b-instruct'
input:
ticket: ${input.ticket_text}
- name: determine-priority
condition: ${input.ticket_id % 3 === 1}
operation: think
component: priority-prompt@v1.0.0
config:
provider: openai
model: gpt-4o-mini
input:
ticket: ${input.ticket_text}
# Variant C: Sequential specialized agents (33%)
- name: classify-first
condition: ${input.ticket_id % 3 === 2}
operation: think
component: classification-prompt@v1.0.0
input:
ticket: ${input.ticket_text}
- name: analyze-with-classification
condition: ${input.ticket_id % 3 === 2}
operation: think
component: contextual-analysis-prompt@v1.0.0
input:
ticket: ${input.ticket_text}
category: ${classify-first.output.category}
# Store results for analysis
- name: log-results
operation: storage
config:
type: d1
query: |
INSERT INTO support_analysis_results (
ticket_id, variant, latency, cost_estimate, timestamp
) VALUES (?, ?, ?, ?, ?)
params:
- ${input.ticket_id}
- ${unified-agent.executed ? 'unified' : classify-ticket.executed ? 'parallel' : 'sequential'}
- ${unified-agent.latency || (classify-ticket.latency + analyze-sentiment.latency) || (classify-first.latency + analyze-with-classification.latency)}
- ${calculate-cost(variant)}
- ${Date.now()}
output:
variant: ${unified-agent.executed ? 'unified' : classify-ticket.executed ? 'parallel' : 'sequential'}
result: ${unified-agent.output || {
category: classify-ticket.output,
sentiment: analyze-sentiment.output,
priority: determine-priority.output
} || analyze-with-classification.output}
Dynamic Variant Allocation
Adjust traffic allocation based on performance:Copy
ensemble: dynamic-multivariate
description: Auto-adjust variant traffic based on performance
agents:
# Get current allocation percentages from KV
- name: get-allocations
operation: storage
config:
type: kv
action: get
key: variant-allocations
default: { aa: 25, ab: 25, ba: 25, bb: 25 }
# Determine which variant to use
- name: select-variant
operation: code
config:
code: |
const allocations = ${get-allocations.output};
const rand = ${input.user_id % 100};
let cumulative = 0;
for (const [variant, pct] of Object.entries(allocations)) {
cumulative += pct;
if (rand < cumulative) return variant;
}
return 'aa';
# Execute selected variant
- name: variant-aa
condition: ${select-variant.output === 'aa'}
operation: think
component: prompt-a@v1.0.0
config:
provider: openai
model: gpt-4o
- name: variant-ab
condition: ${select-variant.output === 'ab'}
operation: think
component: prompt-b@v1.0.0
config:
provider: openai
model: gpt-4o
- name: variant-ba
condition: ${select-variant.output === 'ba'}
operation: think
component: prompt-a@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
- name: variant-bb
condition: ${select-variant.output === 'bb'}
operation: think
component: prompt-b@v1.0.0
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
Copy
// Scheduled every hour
export default {
async scheduled(event: ScheduledEvent, env: Env) {
// Query performance of each variant from last hour
const results = await env.DB.prepare(`
SELECT
variant,
AVG(quality_score) as avg_quality,
AVG(latency) as avg_latency,
COUNT(*) as sample_size
FROM multivariate_results
WHERE timestamp > ?
GROUP BY variant
`).bind(Date.now() - 60 * 60 * 1000).all();
// Calculate new allocations using Thompson Sampling
const allocations = calculateOptimalAllocations(results.results);
// Update KV
await env.CACHE.put('variant-allocations', JSON.stringify(allocations));
console.log('Updated variant allocations:', allocations);
}
};
Statistical Analysis
Ensure statistical significance before declaring a winner:Copy
import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';
describe('multivariate-test', () => {
it('should have statistically significant results', async () => {
const conductor = await TestConductor.create();
const results = { aa: [], ab: [], ba: [], bb: [] };
// Collect 1000 samples per variant
for (let i = 0; i < 4000; i++) {
const result = await conductor.execute('multivariate-2x2', {
user_id: i,
input: testInput
});
const variant = result.output.variant;
results[variant].push({
quality: result.quality_score,
latency: result.latency
});
}
// Calculate statistics for each variant
const stats = Object.entries(results).map(([variant, data]) => ({
variant,
mean: calculateMean(data.map(d => d.quality)),
stddev: calculateStdDev(data.map(d => d.quality)),
sampleSize: data.length
}));
// Find best variant
const best = stats.reduce((a, b) => a.mean > b.mean ? a : b);
// Run t-test against second best
const secondBest = stats.filter(s => s !== best)
.reduce((a, b) => a.mean > b.mean ? a : b);
const tTest = performTTest(best, secondBest);
// p-value < 0.05 means statistically significant
expect(tTest.pValue).toBeLessThan(0.05);
console.log(`Winner: ${best.variant} with p-value ${tTest.pValue}`);
});
});
Best Practices
1. Start with Factorial Design
Test all combinations systematically:Copy
2 models × 2 prompts = 4 variants (2²)
2 models × 2 prompts × 2 temps = 8 variants (2³)
3 models × 2 prompts × 2 temps = 12 variants (3×2×2)
2. Ensure Sufficient Sample Size
Copy
// Minimum sample size per variant
const minSampleSize = 385; // For 95% confidence, ±5% margin
// Monitor sample sizes
const sampleSizes = await env.DB.prepare(`
SELECT variant, COUNT(*) as n
FROM multivariate_results
WHERE timestamp > ?
GROUP BY variant
`).bind(Date.now() - 7 * 24 * 60 * 60 * 1000).all();
const insufficient = sampleSizes.results.filter(r => r.n < minSampleSize);
if (insufficient.length > 0) {
console.warn('Insufficient samples:', insufficient);
}
3. Control for Confounding Variables
Ensure random assignment:Copy
# Good - random based on user_id
condition: ${input.user_id % 4 === 0}
# Bad - not random (time-based bias)
condition: ${Date.now() % 4 === 0}
4. Monitor Interaction Effects
Some combinations may interact unexpectedly:Copy
-- Check for interaction effects
SELECT
model,
prompt,
AVG(quality_score) as avg_quality
FROM multivariate_results
GROUP BY model, prompt;
-- Example result showing interaction:
-- gpt-4o + prompt-a = 0.85 (expected: 0.80)
-- gpt-4o + prompt-b = 0.65 (expected: 0.70)
-- claude + prompt-a = 0.75 (expected: 0.80)
-- claude + prompt-b = 0.85 (expected: 0.70)
-- Prompt-b works better with Claude!

