Multivariate Agent Testing Playbook

Test combinations: models × prompts × temperatures × agents. Find the optimal configuration faster. Multivariate testing goes beyond simple A/B tests by testing multiple variables simultaneously. Instead of testing 2 variants, test 4, 8, or even 16 variants to find the best combination of model, prompt, temperature, and agent configuration.

Why Multivariate Testing?

A/B Testing Limitation:

Test prompt A vs prompt B with model X
Then test model X vs model Y with prompt A
Sequential testing takes weeks

Multivariate Testing Advantage:

Test all combinations simultaneously:
- Prompt A + Model X
- Prompt A + Model Y
- Prompt B + Model X
- Prompt B + Model Y
Find optimal combination in days

2×2 Test (4 Variants)

Test 2 models × 2 prompts = 4 combinations:

ensemble: multivariate-2x2
description: Test 2 models × 2 prompts

agents:
  # Model A (GPT-4o) + Prompt A (25%)
  - name: variant-aa
    condition: ${input.user_id % 4 === 0}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.7

  # Model A (GPT-4o) + Prompt B (25%)
  - name: variant-ab
    condition: ${input.user_id % 4 === 1}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.7

  # Model B (Claude) + Prompt A (25%)
  - name: variant-ba
    condition: ${input.user_id % 4 === 2}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.7

  # Model B (Claude) + Prompt B (25%)
  - name: variant-bb
    condition: ${input.user_id % 4 === 3}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.7

  # Log results
  - name: log-variant
    operation: storage
    config:
      type: d1
      query: |
        INSERT INTO multivariate_results (
          user_id, variant, model, prompt, latency, timestamp
        ) VALUES (?, ?, ?, ?, ?, ?)
      params:
        - ${input.user_id}
        - ${variant-aa.executed ? 'aa' : variant-ab.executed ? 'ab' : variant-ba.executed ? 'ba' : 'bb'}
        - ${variant-aa.executed || variant-ab.executed ? 'gpt-4o' : 'claude'}
        - ${variant-aa.executed || variant-ba.executed ? 'prompt-a' : 'prompt-b'}
        - ${variant-aa.latency || variant-ab.latency || variant-ba.latency || variant-bb.latency}
        - ${Date.now()}

output:
  variant: ${variant-aa.executed ? 'aa' : variant-ab.executed ? 'ab' : variant-ba.executed ? 'ba' : 'bb'}
  result: ${variant-aa.output || variant-ab.output || variant-ba.output || variant-bb.output}

Analyze results:

SELECT
  variant,
  model,
  prompt,
  COUNT(*) as requests,
  AVG(latency) as avg_latency,
  SUM(CASE WHEN feedback_score > 0.8 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as quality_rate
FROM multivariate_results
WHERE timestamp > strftime('%s', 'now', '-7 days') * 1000
GROUP BY variant, model, prompt
ORDER BY quality_rate DESC, avg_latency ASC;

2×2×2 Test (8 Variants)

Test 2 models × 2 prompts × 2 temperatures = 8 combinations:

ensemble: multivariate-2x2x2
description: Test 2 models × 2 prompts × 2 temperatures

agents:
  # GPT-4o + Prompt A + Temp 0.3 (12.5%)
  - name: variant-aa-low
    condition: ${input.user_id % 8 === 0}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.3

  # GPT-4o + Prompt A + Temp 0.9 (12.5%)
  - name: variant-aa-high
    condition: ${input.user_id % 8 === 1}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.9

  # GPT-4o + Prompt B + Temp 0.3 (12.5%)
  - name: variant-ab-low
    condition: ${input.user_id % 8 === 2}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.3

  # GPT-4o + Prompt B + Temp 0.9 (12.5%)
  - name: variant-ab-high
    condition: ${input.user_id % 8 === 3}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: openai
      model: gpt-4o
      temperature: 0.9

  # Claude + Prompt A + Temp 0.3 (12.5%)
  - name: variant-ba-low
    condition: ${input.user_id % 8 === 4}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.3

  # Claude + Prompt A + Temp 0.9 (12.5%)
  - name: variant-ba-high
    condition: ${input.user_id % 8 === 5}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.9

  # Claude + Prompt B + Temp 0.3 (12.5%)
  - name: variant-bb-low
    condition: ${input.user_id % 8 === 6}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.3

  # Claude + Prompt B + Temp 0.9 (12.5%)
  - name: variant-bb-high
    condition: ${input.user_id % 8 === 7}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      temperature: 0.9

Agent Version Testing

Test multiple versions of the same agent:

ensemble: test-agent-versions
description: Compare agent v1.0.0 vs v2.0.0 vs v3.0.0

agents:
  # Route 33% to each version
  - name: v1
    condition: ${input.user_id % 3 === 0}
    agent: my-agent@v1.0.0
    input:
      data: ${input.data}

  - name: v2
    condition: ${input.user_id % 3 === 1}
    agent: my-agent@v2.0.0
    input:
      data: ${input.data}

  - name: v3
    condition: ${input.user_id % 3 === 2}
    agent: my-agent@v3.0.0
    input:
      data: ${input.data}

  # Track which version was used
  - name: log-version
    operation: storage
    config:
      type: d1
      query: |
        INSERT INTO agent_versions (
          user_id, version, latency, success, timestamp
        ) VALUES (?, ?, ?, ?, ?)
      params:
        - ${input.user_id}
        - ${v1.executed ? 'v1.0.0' : v2.executed ? 'v2.0.0' : 'v3.0.0'}
        - ${v1.latency || v2.latency || v3.latency}
        - ${v1.success || v2.success || v3.success}
        - ${Date.now()}

output:
  version: ${v1.executed ? 'v1.0.0' : v2.executed ? 'v2.0.0' : 'v3.0.0'}
  result: ${v1.output || v2.output || v3.output}

Multi-Agent Coordination Testing

Test different multi-agent coordination patterns:

Parallel Research Agents

ensemble: multi-agent-research
description: Coordinate research, analysis, and writing agents

state:
  schema:
    researchFindings: array
    analysisResults: object
    synthesis: string

agents:
  # Phase 1: Parallel Research by Specialized Agents
  - parallel:
      - name: research-technical
        operation: think
        config:
          provider: openai
          model: gpt-4o
          prompt: |
            You are a technical research specialist.
            Research technical aspects of: ${input.topic}
            Provide detailed technical information and specifications.
        state:
          set: [researchFindings]

      - name: research-market
        operation: think
        config:
          provider: anthropic
          model: claude-3-5-sonnet-20241022
          prompt: |
            You are a market research specialist.
            Research market aspects of: ${input.topic}
            Analyze trends, adoption rates, and business implications.
        state:
          set: [researchFindings]

      - name: research-competitive
        operation: think
        config:
          provider: openai
          model: gpt-4o
          prompt: |
            You are a competitive analysis specialist.
            Research competitive landscape for: ${input.topic}
            Identify competitors and comparative advantages.
        state:
          set: [researchFindings]

  # Phase 2: Synthesis Agent Combines Research
  - name: synthesize-research
    operation: think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      prompt: |
        You are a research synthesis specialist.
        Combine multiple research perspectives into cohesive insights.

        Technical Research: ${research-technical.output}
        Market Research: ${research-market.output}
        Competitive Research: ${research-competitive.output}

        Provide 5-7 key insights that integrate all perspectives.
    state:
      use: [researchFindings]
      set: [synthesis]

  # Phase 3: Parallel Analysis by Domain Experts
  - parallel:
      - name: analyze-opportunities
        operation: think
        config:
          provider: openai
          model: gpt-4o
          prompt: |
            You are a strategic opportunity analyst.
            Based on these insights, identify opportunities:
            ${synthesize-research.output}
        state:
          set: [analysisResults]

      - name: analyze-risks
        operation: think
        config:
          provider: anthropic
          model: claude-3-5-sonnet-20241022
          prompt: |
            You are a risk assessment specialist.
            Based on these insights, analyze risks:
            ${synthesize-research.output}
        state:
          set: [analysisResults]

      - name: analyze-recommendations
        operation: think
        config:
          provider: openai
          model: gpt-4o
          prompt: |
            You are a strategic advisor.
            Based on these insights, provide actionable recommendations:
            ${synthesize-research.output}
        state:
          set: [analysisResults]

  # Phase 4: Generate Final Report
  - name: generate-report
    operation: think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      prompt: |
        Generate a comprehensive report based on:

        Synthesis: ${synthesize-research.output}
        Opportunities: ${analyze-opportunities.output}
        Risks: ${analyze-risks.output}
        Recommendations: ${analyze-recommendations.output}

output:
  report: ${generate-report.output}

Testing Different Coordination Patterns

Test sequential vs parallel vs hybrid coordination:

ensemble: test-coordination-patterns
description: Compare different multi-agent coordination approaches

agents:
  # Pattern A: Sequential (33%)
  - name: pattern-sequential
    condition: ${input.user_id % 3 === 0}
    operation: code
    config:
      code: |
        // Execute agents sequentially
        const agent1 = await executeAgent('agent-1', input);
        const agent2 = await executeAgent('agent-2', agent1.output);
        const agent3 = await executeAgent('agent-3', agent2.output);
        return agent3.output;

  # Pattern B: Parallel (33%)
  - parallel:
      - name: pattern-parallel-1
        condition: ${input.user_id % 3 === 1}
        operation: think
      - name: pattern-parallel-2
        condition: ${input.user_id % 3 === 1}
        operation: think
      - name: pattern-parallel-3
        condition: ${input.user_id % 3 === 1}
        operation: think

  # Pattern C: Hybrid (33%)
  - name: pattern-hybrid-fetch
    condition: ${input.user_id % 3 === 2}
    operation: http

  - parallel:
      - name: pattern-hybrid-analyze-1
        condition: ${input.user_id % 3 === 2}
        operation: think
        input: ${pattern-hybrid-fetch.output}
      - name: pattern-hybrid-analyze-2
        condition: ${input.user_id % 3 === 2}
        operation: think
        input: ${pattern-hybrid-fetch.output}

  - name: pattern-hybrid-synthesize
    condition: ${input.user_id % 3 === 2}
    operation: think
    input:
      analyze1: ${pattern-hybrid-analyze-1.output}
      analyze2: ${pattern-hybrid-analyze-2.output}

Component Version Matrix Testing

Test all combinations of component versions:

ensemble: component-version-matrix
description: Test prompt@v1 vs v2 with config@v1 vs v2

agents:
  # Prompt v1 + Config v1 (25%)
  - name: variant-p1c1
    condition: ${input.user_id % 4 === 0}
    operation: think
    component: my-prompt@v1.0.0
    config:
      provider: openai
      model: ${component.model-config@v1.0.0.model}
      temperature: ${component.model-config@v1.0.0.temperature}

  # Prompt v1 + Config v2 (25%)
  - name: variant-p1c2
    condition: ${input.user_id % 4 === 1}
    operation: think
    component: my-prompt@v1.0.0
    config:
      provider: openai
      model: ${component.model-config@v2.0.0.model}
      temperature: ${component.model-config@v2.0.0.temperature}

  # Prompt v2 + Config v1 (25%)
  - name: variant-p2c1
    condition: ${input.user_id % 4 === 2}
    operation: think
    component: my-prompt@v2.0.0
    config:
      provider: openai
      model: ${component.model-config@v1.0.0.model}
      temperature: ${component.model-config@v1.0.0.temperature}

  # Prompt v2 + Config v2 (25%)
  - name: variant-p2c2
    condition: ${input.user_id % 4 === 3}
    operation: think
    component: my-prompt@v2.0.0
    config:
      provider: openai
      model: ${component.model-config@v2.0.0.model}
      temperature: ${component.model-config@v2.0.0.temperature}

Real-World Example: Customer Support Analysis

Test different agent combinations for customer support ticket analysis:

ensemble: support-ticket-analysis
description: Test classification + sentiment + priority agents

agents:
  # Variant A: Single unified agent (33%)
  - name: unified-agent
    condition: ${input.ticket_id % 3 === 0}
    operation: think
    component: unified-support-prompt@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    input:
      ticket: ${input.ticket_text}

  # Variant B: Specialized parallel agents (33%)
  - parallel:
      - name: classify-ticket
        condition: ${input.ticket_id % 3 === 1}
        operation: think
        component: classification-prompt@v1.0.0
        config:
          provider: openai
          model: gpt-4o-mini
        input:
          ticket: ${input.ticket_text}

      - name: analyze-sentiment
        condition: ${input.ticket_id % 3 === 1}
        operation: think
        component: sentiment-prompt@v1.0.0
        config:
          provider: cloudflare
          model: '@cf/meta/llama-3.1-8b-instruct'
        input:
          ticket: ${input.ticket_text}

      - name: determine-priority
        condition: ${input.ticket_id % 3 === 1}
        operation: think
        component: priority-prompt@v1.0.0
        config:
          provider: openai
          model: gpt-4o-mini
        input:
          ticket: ${input.ticket_text}

  # Variant C: Sequential specialized agents (33%)
  - name: classify-first
    condition: ${input.ticket_id % 3 === 2}
    operation: think
    component: classification-prompt@v1.0.0
    input:
      ticket: ${input.ticket_text}

  - name: analyze-with-classification
    condition: ${input.ticket_id % 3 === 2}
    operation: think
    component: contextual-analysis-prompt@v1.0.0
    input:
      ticket: ${input.ticket_text}
      category: ${classify-first.output.category}

  # Store results for analysis
  - name: log-results
    operation: storage
    config:
      type: d1
      query: |
        INSERT INTO support_analysis_results (
          ticket_id, variant, latency, cost_estimate, timestamp
        ) VALUES (?, ?, ?, ?, ?)
      params:
        - ${input.ticket_id}
        - ${unified-agent.executed ? 'unified' : classify-ticket.executed ? 'parallel' : 'sequential'}
        - ${unified-agent.latency || (classify-ticket.latency + analyze-sentiment.latency) || (classify-first.latency + analyze-with-classification.latency)}
        - ${calculate-cost(variant)}
        - ${Date.now()}

output:
  variant: ${unified-agent.executed ? 'unified' : classify-ticket.executed ? 'parallel' : 'sequential'}
  result: ${unified-agent.output || {
    category: classify-ticket.output,
    sentiment: analyze-sentiment.output,
    priority: determine-priority.output
  } || analyze-with-classification.output}

Dynamic Variant Allocation

Adjust traffic allocation based on performance:

ensemble: dynamic-multivariate
description: Auto-adjust variant traffic based on performance

agents:
  # Get current allocation percentages from KV
  - name: get-allocations
    operation: storage
    config:
      type: kv
      action: get
      key: variant-allocations
      default: { aa: 25, ab: 25, ba: 25, bb: 25 }

  # Determine which variant to use
  - name: select-variant
    operation: code
    config:
      code: |
        const allocations = ${get-allocations.output};
        const rand = ${input.user_id % 100};

        let cumulative = 0;
        for (const [variant, pct] of Object.entries(allocations)) {
          cumulative += pct;
          if (rand < cumulative) return variant;
        }
        return 'aa';

  # Execute selected variant
  - name: variant-aa
    condition: ${select-variant.output === 'aa'}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: openai
      model: gpt-4o

  - name: variant-ab
    condition: ${select-variant.output === 'ab'}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: openai
      model: gpt-4o

  - name: variant-ba
    condition: ${select-variant.output === 'ba'}
    operation: think
    component: prompt-a@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022

  - name: variant-bb
    condition: ${select-variant.output === 'bb'}
    operation: think
    component: prompt-b@v1.0.0
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022

Auto-adjust allocations based on metrics:

// Scheduled every hour
export default {
  async scheduled(event: ScheduledEvent, env: Env) {
    // Query performance of each variant from last hour
    const results = await env.DB.prepare(`
      SELECT
        variant,
        AVG(quality_score) as avg_quality,
        AVG(latency) as avg_latency,
        COUNT(*) as sample_size
      FROM multivariate_results
      WHERE timestamp > ?
      GROUP BY variant
    `).bind(Date.now() - 60 * 60 * 1000).all();

    // Calculate new allocations using Thompson Sampling
    const allocations = calculateOptimalAllocations(results.results);

    // Update KV
    await env.CACHE.put('variant-allocations', JSON.stringify(allocations));

    console.log('Updated variant allocations:', allocations);
  }
};

Statistical Analysis

Ensure statistical significance before declaring a winner:

import { describe, it, expect } from 'vitest';
import { TestConductor } from '@ensemble-edge/conductor/testing';

describe('multivariate-test', () => {
  it('should have statistically significant results', async () => {
    const conductor = await TestConductor.create();
    const results = { aa: [], ab: [], ba: [], bb: [] };

    // Collect 1000 samples per variant
    for (let i = 0; i < 4000; i++) {
      const result = await conductor.execute('multivariate-2x2', {
        user_id: i,
        input: testInput
      });

      const variant = result.output.variant;
      results[variant].push({
        quality: result.quality_score,
        latency: result.latency
      });
    }

    // Calculate statistics for each variant
    const stats = Object.entries(results).map(([variant, data]) => ({
      variant,
      mean: calculateMean(data.map(d => d.quality)),
      stddev: calculateStdDev(data.map(d => d.quality)),
      sampleSize: data.length
    }));

    // Find best variant
    const best = stats.reduce((a, b) => a.mean > b.mean ? a : b);

    // Run t-test against second best
    const secondBest = stats.filter(s => s !== best)
      .reduce((a, b) => a.mean > b.mean ? a : b);

    const tTest = performTTest(best, secondBest);

    // p-value < 0.05 means statistically significant
    expect(tTest.pValue).toBeLessThan(0.05);

    console.log(`Winner: ${best.variant} with p-value ${tTest.pValue}`);
  });
});

Best Practices

1. Start with Factorial Design

Test all combinations systematically:

models × 2 prompts = 4 variants (2²)
models × 2 prompts × 2 temps = 8 variants (2³)
models × 2 prompts × 2 temps = 12 variants (3×2×2)

2. Ensure Sufficient Sample Size

// Minimum sample size per variant
const minSampleSize = 385; // For 95% confidence, ±5% margin

// Monitor sample sizes
const sampleSizes = await env.DB.prepare(`
  SELECT variant, COUNT(*) as n
  FROM multivariate_results
  WHERE timestamp > ?
  GROUP BY variant
`).bind(Date.now() - 7 * 24 * 60 * 60 * 1000).all();

const insufficient = sampleSizes.results.filter(r => r.n < minSampleSize);
if (insufficient.length > 0) {
  console.warn('Insufficient samples:', insufficient);
}

3. Control for Confounding Variables

Ensure random assignment:

# Good - random based on user_id
condition: ${input.user_id % 4 === 0}

# Bad - not random (time-based bias)
condition: ${Date.now() % 4 === 0}

4. Monitor Interaction Effects

Some combinations may interact unexpectedly:

-- Check for interaction effects
SELECT
  model,
  prompt,
  AVG(quality_score) as avg_quality
FROM multivariate_results
GROUP BY model, prompt;

-- Example result showing interaction:
-- gpt-4o + prompt-a = 0.85  (expected: 0.80)
-- gpt-4o + prompt-b = 0.65  (expected: 0.70)
-- claude + prompt-a = 0.75  (expected: 0.80)
-- claude + prompt-b = 0.85  (expected: 0.70)
-- Prompt-b works better with Claude!

Next Steps

A/B Testing

Start with simple A/B tests

Progressive Deployment

Gradual rollout strategies

Multi-Agent Analysis

Complex multi-agent workflows

Testing & Observability

Testing best practices

Conductor

Getting Started

Core Concepts

Building

Operations Reference

Pre-built Agents Reference

Playbooks

Reference

Multivariate Agent Testing

Multivariate Agent Testing Playbook

Why Multivariate Testing?

2×2 Test (4 Variants)

2×2×2 Test (8 Variants)

Agent Version Testing

Multi-Agent Coordination Testing

Parallel Research Agents

Testing Different Coordination Patterns

Component Version Matrix Testing

Real-World Example: Customer Support Analysis

Dynamic Variant Allocation

Statistical Analysis

Best Practices

1. Start with Factorial Design

2. Ensure Sufficient Sample Size

3. Control for Confounding Variables

4. Monitor Interaction Effects

Next Steps

A/B Testing

Progressive Deployment

Multi-Agent Analysis

Testing & Observability

Conductor

Getting Started

Core Concepts

Building

Operations Reference

Pre-built Agents Reference

Playbooks

Reference

​Multivariate Agent Testing Playbook

​Why Multivariate Testing?

​2×2 Test (4 Variants)

​2×2×2 Test (8 Variants)

​Agent Version Testing

​Multi-Agent Coordination Testing

​Parallel Research Agents

​Testing Different Coordination Patterns

​Component Version Matrix Testing

​Real-World Example: Customer Support Analysis

​Dynamic Variant Allocation

​Statistical Analysis

​Best Practices

​1. Start with Factorial Design

​2. Ensure Sufficient Sample Size

​3. Control for Confounding Variables

​4. Monitor Interaction Effects

​Next Steps

A/B Testing

Progressive Deployment

Multi-Agent Analysis

Testing & Observability

Multivariate Agent Testing Playbook

Why Multivariate Testing?

2×2 Test (4 Variants)

2×2×2 Test (8 Variants)

Agent Version Testing

Multi-Agent Coordination Testing

Parallel Research Agents

Testing Different Coordination Patterns

Component Version Matrix Testing

Real-World Example: Customer Support Analysis

Dynamic Variant Allocation

Statistical Analysis

Best Practices

1. Start with Factorial Design

2. Ensure Sufficient Sample Size

3. Control for Confounding Variables

4. Monitor Interaction Effects

Next Steps