Skip to main content

Overview

A/B testing is a core architectural principle of Ensemble Edge, not an afterthought. By combining Edgit’s multiverse versioning with Conductor’s edge-native execution, you can run experiments at a scale and speed impossible with traditional platforms.
The Vision: A/B testing is just the beginning. As Conductor matures, we’re building toward autonomous optimization that leverages edge distribution and AI to find optimal configurations faster than any human-driven experimentation process.

Current Capabilities

Multiverse Versioning with Edgit

Edgit’s independent component versioning enables true multiverse experimentation:
# Production runs three different timelines simultaneously
extraction-prompt@v0.1.0  # Ancient but perfect
company-analyzer@v3.0.0   # Latest stable
validation-sql@v2.5.0     # Optimal performance

# Deploy variant A
edgit deploy set extraction-prompt v1.0.0 --to prod-variant-a
edgit deploy set company-analyzer v2.0.0 --to prod-variant-a

# Deploy variant B
edgit deploy set extraction-prompt v1.5.0 --to prod-variant-b
edgit deploy set company-analyzer v2.0.0 --to prod-variant-b

# Both live simultaneously, no conflicts
Key Advantage: No need to version the entire codebase. Test optimal combinations from different points in history.

A/B Testing Patterns

1. Prompt Optimization

Test different prompt versions to maximize quality:
# ensembles/company-analysis.yaml
name: company-analysis

flow:
  - member: analyze
    type: Think
    config:
      # Load from deployed version
      component: analysis-prompt@${env.PROMPT_VERSION}
      provider: anthropic
      model: claude-3-5-sonnet-20241022
    scoring:
      evaluator: validate
      thresholds:
        minimum: 0.8
Deployment:
# Variant A: Original prompt
edgit deploy set analysis-prompt v1.0.0 --to prod-a
wrangler secret put PROMPT_VERSION --env prod-a
# Set: "v1.0.0"

# Variant B: Refined prompt
edgit deploy set analysis-prompt v1.2.0 --to prod-b
wrangler secret put PROMPT_VERSION --env prod-b
# Set: "v1.2.0"

# Route 50% traffic to each variant
# Cloudflare Workers can route based on userId hash
Measure:
  • Quality scores from Conductor’s scoring system
  • Execution time and cost
  • User satisfaction metrics
  • Downstream conversion rates

2. Model Selection

Test different AI models for optimal cost/quality balance:
# Variant A: GPT-4 (high quality, high cost)
config:
  provider: openai
  model: gpt-4o
  routing: cloudflare-gateway

# Variant B: Claude Sonnet (balanced)
config:
  provider: anthropic
  model: claude-3-5-sonnet-20241022
  routing: cloudflare-gateway

# Variant C: Workers AI (low cost, edge-native)
config:
  provider: cloudflare
  model: '@cf/meta/llama-3.1-8b-instruct'
Edge Advantage: All models execute at the edge with AI Gateway caching, ensuring fair latency comparison.

3. Workflow Structure

Test different ensemble flows:
# ensembles/company-intel-v1.yaml (Sequential)
flow:
  - member: fetch-data
  - member: analyze
  - member: generate-report

# ensembles/company-intel-v2.yaml (Parallel)
flow:
  - parallel:
      - member: fetch-company-data
      - member: fetch-financials
      - member: fetch-news
  - member: analyze-all
  - member: generate-report
Measure:
  • Total execution time
  • Cache hit rates
  • Error rates
  • Quality scores

4. State Management Strategies

Test prop drilling vs shared state:
# Variant A: Prop drilling
flow:
  - member: fetch
  - member: transform
    input:
      data: ${fetch.output.data}
  - member: analyze
    input:
      data: ${transform.output.data}

# Variant B: Shared state
state:
  schema:
    data: object

flow:
  - member: fetch
    state:
      set: [data]
  - member: transform
    state:
      use: [data]
  - member: analyze
    state:
      use: [data]
Measure:
  • Bundle size (state management adds overhead)
  • Execution speed
  • Debuggability and maintainability

5. Caching Strategies

Test aggressive vs conservative caching:
# Variant A: Aggressive caching
- member: expensive-api-call
  cache:
    ttl: 86400  # 24 hours

# Variant B: Conservative caching
- member: expensive-api-call
  cache:
    ttl: 3600   # 1 hour

# Variant C: No caching
- member: expensive-api-call
Measure:
  • Cache hit rate
  • Data freshness
  • Cost savings
  • User satisfaction

Edge-Native Experimentation

Instant Rollout

Traditional A/B testing requires deployment pipelines. With Conductor + Edgit:
# Deploy new version globally in < 50ms
edgit deploy set analysis-prompt v2.0.0 --to prod

# Instant rollback if quality drops
edgit deploy set analysis-prompt v1.0.0 --to prod
No build step. No container deployment. No waiting.

Geographic Distribution

Test variants by region automatically:
// Cloudflare Workers automatically provides request.cf.colo
export default {
  async fetch(request: Request, env: Env) {
    const colo = request.cf?.colo; // Airport code (e.g., "SJC")

    // Route US West to variant A, rest to variant B
    const variant = ['SJC', 'LAX', 'SEA'].includes(colo)
      ? 'variant-a'
      : 'variant-b';

    const promptVersion = env[`PROMPT_VERSION_${variant.toUpperCase()}`];

    return conductorClient.execute({
      ensemble: 'company-analysis',
      input: { ...input, promptVersion }
    });
  }
};
Edge Advantage: 300+ locations worldwide, no latency penalty for experimentation.

User-Based Routing

Consistent user experience with deterministic routing:
function getVariant(userId: string): 'a' | 'b' {
  // Deterministic hash ensures same user always gets same variant
  const hash = hashUserId(userId);
  return hash % 100 < 50 ? 'a' : 'b';
}

export default {
  async fetch(request: Request, env: Env) {
    const userId = request.headers.get('x-user-id');
    const variant = getVariant(userId);

    return conductorClient.execute({
      ensemble: 'company-analysis',
      input: { ...input, variant }
    });
  }
};

Measuring Results

Built-in Quality Scoring

Conductor’s scoring system provides automatic quality metrics:
scoring:
  enabled: true
  defaultThresholds:
    minimum: 0.7
    target: 0.85

flow:
  - member: analyze
    scoring:
      evaluator: validate
      thresholds:
        minimum: 0.8
      criteria:
        accuracy: "Analysis must be factually accurate"
        completeness: "All required sections present"
Every execution emits:
  • Quality score (0.0 - 1.0)
  • Execution time
  • Token usage / cost
  • Cache hit/miss
  • Retry count

Analytics Engine Integration

Log experiment data to Cloudflare Analytics Engine:
// Log A/B test results
env.ANALYTICS?.writeDataPoint({
  blobs: [
    ensembleName,
    variant,
    userId
  ],
  doubles: [
    executionTime,
    qualityScore,
    cost
  ],
  indexes: [
    variant  // Fast filtering by variant
  ]
});
Query results:
SELECT
  variant,
  AVG(double1) as avg_execution_time,
  AVG(double2) as avg_quality_score,
  AVG(double3) as avg_cost,
  COUNT(*) as sample_size
FROM analytics
WHERE blob1 = 'company-analysis'
  AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY variant

Custom Metrics

Track business outcomes:
const result = await conductorClient.execute({
  ensemble: 'company-analysis',
  input: { domain: 'acme.com' }
});

// Log business metrics
await logMetric({
  variant: env.VARIANT,
  qualityScore: result.metadata.scoring?.score,
  executionTime: result.executionTime,
  userConverted: await checkConversion(userId),
  revenue: await getRevenue(userId),
  timestamp: Date.now()
});

The Future: Autonomous Optimization

A/B testing is just the beginning. Here’s where Ensemble Edge is heading:

Multi-Armed Bandit

Automatically adjust traffic based on real-time results:
// Conductor learns optimal allocation
const bandit = new MultiArmedBandit({
  variants: ['v1.0.0', 'v1.2.0', 'v2.0.0'],
  metric: 'quality_score',
  explorationRate: 0.1
});

// Starts 33/33/33, converges to optimal allocation
const variant = await bandit.selectVariant(context);
Coming in v1.1:
  • Bayesian optimization
  • Thompson sampling
  • Contextual bandits (vary by user attributes)

Hyperparameter Tuning

Optimize LLM parameters automatically:
// Define search space
const searchSpace = {
  temperature: [0.3, 0.5, 0.7, 0.9],
  maxTokens: [1000, 2000, 4000],
  topP: [0.8, 0.9, 0.95, 1.0]
};

// Conductor explores combinations
const optimizer = new GridSearch({
  searchSpace,
  metric: 'quality_score',
  budget: 1000  // Max evaluations
});

await optimizer.optimize('analysis-prompt');
Coming in v1.2:
  • Bayesian optimization for continuous parameters
  • Early stopping for failed configurations
  • Multi-objective optimization (quality + cost + speed)

Prompt Evolution

AI-generated prompt variants tested automatically:
// Conductor generates variants using meta-prompting
const promptEvolution = new PromptEvolution({
  basePrompt: extractionPrompt,
  objective: 'maximize accuracy on financial data',
  generations: 5,
  populationSize: 10,
  mutationRate: 0.2
});

// Evolves prompts overnight, selects best performer
const optimizedPrompt = await promptEvolution.evolve();
Coming in v1.3:
  • Genetic algorithms for prompt optimization
  • Cross-breeding high-performing prompts
  • Automatic evaluation against test suites

Edge-Native Gradient Descent

Optimize at the edge with zero central coordination:
┌─────────────────────────────────────────────────┐
│              Global Optimization                 │
│                                                  │
│  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐       │
│  │ SJC  │  │ IAD  │  │ LHR  │  │ SYD  │       │
│  │Edge  │  │Edge  │  │Edge  │  │Edge  │       │
│  │      │  │      │  │      │  │      │       │
│  │Test  │  │Test  │  │Test  │  │Test  │       │
│  │Local │  │Local │  │Local │  │Local │       │
│  └──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘       │
│     │         │         │         │             │
│     └─────────┴─────────┴─────────┘             │
│              Durable Objects                     │
│           Aggregate & Optimize                   │
└─────────────────────────────────────────────────┘
Each edge location:
  • Tests configurations locally
  • Reports results to Durable Object
  • Receives updated optimal config
Zero latency penalty. Millions of experiments per day.

Composite Optimization

Optimize the entire stack simultaneously:
# Conductor optimizes across:
optimization:
  dimensions:
    - prompt_version      # Which prompt
    - model              # Which LLM
    - temperature        # LLM parameter
    - caching_strategy   # Cache TTL
    - parallel_degree    # How much parallelization
    - state_management   # Shared state vs prop drilling
    - retry_strategy     # Exponential vs linear backoff

  objectives:
    - maximize: quality_score
    - minimize: cost
    - minimize: latency

  constraints:
    - quality_score >= 0.85
    - latency <= 2000  # ms
    - cost <= 0.10     # per request
Coming in v2.0:
  • Pareto frontier optimization
  • Multi-objective evolutionary algorithms
  • Automatic constraint satisfaction

Why Edge Makes This Possible

Traditional A/B testing platforms face fundamental limitations:

❌ Traditional Platforms

  • Centralized coordination - Single point of failure
  • Deployment latency - Minutes to hours to roll out variants
  • Geographic bias - US-centric testing affects global users differently
  • Scale limits - Expensive to test millions of combinations
  • Manual analysis - Humans interpret results and make decisions

✅ Ensemble Edge

  • Distributed execution - 300+ locations worldwide
  • Instant deployment - < 50ms global rollout via KV
  • Geographic fairness - Each region tests independently
  • Unlimited scale - Edge workers handle millions of experiments
  • Autonomous optimization - AI finds optimal configurations automatically

Best Practices

1. Start Simple

Begin with single-dimension tests:
# Test one thing at a time
edgit deploy set extraction-prompt v1.0.0 --to prod-a
edgit deploy set extraction-prompt v1.1.0 --to prod-b

# Once confident, expand

2. Use Quality Scoring

Let Conductor measure quality automatically:
scoring:
  enabled: true
  defaultThresholds:
    minimum: 0.8  # Auto-fail below this

3. Monitor Business Metrics

Don’t optimize for proxy metrics:
// ✅ Good - Track revenue
await logMetric({ variant, revenue });

// ❌ Bad - Optimize engagement without checking conversion
await logMetric({ variant, clicks });

4. Set Sample Size Requirements

Ensure statistical significance:
// Require minimum sample size before declaring winner
const MIN_SAMPLES = 1000;
const CONFIDENCE = 0.95;

if (variantA.samples >= MIN_SAMPLES &&
    variantB.samples >= MIN_SAMPLES) {
  const pValue = tTest(variantA, variantB);
  if (pValue < (1 - CONFIDENCE)) {
    // Statistically significant
    promoteWinner(variantA.score > variantB.score ? 'a' : 'b');
  }
}

5. Use Gradual Rollout

Start with small traffic percentage:
// Week 1: 5% to variant B
if (Math.random() < 0.05) variant = 'b';

// Week 2: 25% to variant B (if successful)
if (Math.random() < 0.25) variant = 'b';

// Week 3: 100% to variant B (promote to default)

Roadmap

v1.1 (Q2 2025)

  • ✅ Multi-armed bandit support
  • ✅ Bayesian optimization
  • ✅ Automatic traffic allocation
  • ✅ Statistical significance testing

v1.2 (Q3 2025)

  • ✅ Hyperparameter grid search
  • ✅ Multi-objective optimization
  • ✅ Contextual bandits
  • ✅ Automatic rollback on quality degradation

v1.3 (Q4 2025)

  • ✅ Prompt evolution with genetic algorithms
  • ✅ Meta-prompting for variant generation
  • ✅ Composite optimization across full stack

v2.0 (2026)

  • ✅ Autonomous edge-native gradient descent
  • ✅ Zero-latency optimization at 300+ locations
  • ✅ Pareto frontier multi-objective optimization
  • ✅ Learned optimization strategies per ensemble

Why This Matters

Traditional A/B testing: “Let’s test prompt A vs prompt B for 2 weeks, analyze results, pick winner, deploy, repeat.” Ensemble Edge vision: “Deploy 50 variants globally, let edge locations test locally, AI finds optimal configuration in 48 hours, automatically promotes winner, continues optimizing forever.” The difference: 10x faster iteration, 100x more experiments, 0x human coordination overhead.
A/B testing is not a feature—it’s a fundamental architectural capability of Ensemble Edge. By combining Git-native versioning with edge execution, we’re building toward optimization at a scale impossible with traditional platforms.This is just the beginning.