Overview
A/B testing is a core architectural principle of Ensemble Edge, not an afterthought. By combining Edgit’s multiverse versioning with Conductor’s edge-native execution, you can run experiments at a scale and speed impossible with traditional platforms.
The Vision: A/B testing is just the beginning. As Conductor matures, we’re building toward autonomous optimization that leverages edge distribution and AI to find optimal configurations faster than any human-driven experimentation process.
Current Capabilities
Multiverse Versioning with Edgit
Edgit’s independent component versioning enables true multiverse experimentation:
# Production runs three different timelines simultaneously
extraction-prompt@v0.1.0 # Ancient but perfect
company-analyzer@v3.0.0 # Latest stable
validation-sql@v2.5.0 # Optimal performance
# Deploy variant A
edgit deploy set extraction-prompt v1.0.0 --to prod-variant-a
edgit deploy set company-analyzer v2.0.0 --to prod-variant-a
# Deploy variant B
edgit deploy set extraction-prompt v1.5.0 --to prod-variant-b
edgit deploy set company-analyzer v2.0.0 --to prod-variant-b
# Both live simultaneously, no conflicts
Key Advantage: No need to version the entire codebase. Test optimal combinations from different points in history.
A/B Testing Patterns
1. Prompt Optimization
Test different prompt versions to maximize quality:
# ensembles/company-analysis.yaml
name: company-analysis
flow:
- member: analyze
type: Think
config:
# Load from deployed version
component: analysis-prompt@${env.PROMPT_VERSION}
provider: anthropic
model: claude-3-5-sonnet-20241022
scoring:
evaluator: validate
thresholds:
minimum: 0.8
Deployment:
# Variant A: Original prompt
edgit deploy set analysis-prompt v1.0.0 --to prod-a
wrangler secret put PROMPT_VERSION --env prod-a
# Set: "v1.0.0"
# Variant B: Refined prompt
edgit deploy set analysis-prompt v1.2.0 --to prod-b
wrangler secret put PROMPT_VERSION --env prod-b
# Set: "v1.2.0"
# Route 50% traffic to each variant
# Cloudflare Workers can route based on userId hash
Measure:
- Quality scores from Conductor’s scoring system
- Execution time and cost
- User satisfaction metrics
- Downstream conversion rates
2. Model Selection
Test different AI models for optimal cost/quality balance:
# Variant A: GPT-4 (high quality, high cost)
config:
provider: openai
model: gpt-4o
routing: cloudflare-gateway
# Variant B: Claude Sonnet (balanced)
config:
provider: anthropic
model: claude-3-5-sonnet-20241022
routing: cloudflare-gateway
# Variant C: Workers AI (low cost, edge-native)
config:
provider: cloudflare
model: '@cf/meta/llama-3.1-8b-instruct'
Edge Advantage: All models execute at the edge with AI Gateway caching, ensuring fair latency comparison.
3. Workflow Structure
Test different ensemble flows:
# ensembles/company-intel-v1.yaml (Sequential)
flow:
- member: fetch-data
- member: analyze
- member: generate-report
# ensembles/company-intel-v2.yaml (Parallel)
flow:
- parallel:
- member: fetch-company-data
- member: fetch-financials
- member: fetch-news
- member: analyze-all
- member: generate-report
Measure:
- Total execution time
- Cache hit rates
- Error rates
- Quality scores
4. State Management Strategies
Test prop drilling vs shared state:
# Variant A: Prop drilling
flow:
- member: fetch
- member: transform
input:
data: ${fetch.output.data}
- member: analyze
input:
data: ${transform.output.data}
# Variant B: Shared state
state:
schema:
data: object
flow:
- member: fetch
state:
set: [data]
- member: transform
state:
use: [data]
- member: analyze
state:
use: [data]
Measure:
- Bundle size (state management adds overhead)
- Execution speed
- Debuggability and maintainability
5. Caching Strategies
Test aggressive vs conservative caching:
# Variant A: Aggressive caching
- member: expensive-api-call
cache:
ttl: 86400 # 24 hours
# Variant B: Conservative caching
- member: expensive-api-call
cache:
ttl: 3600 # 1 hour
# Variant C: No caching
- member: expensive-api-call
Measure:
- Cache hit rate
- Data freshness
- Cost savings
- User satisfaction
Edge-Native Experimentation
Instant Rollout
Traditional A/B testing requires deployment pipelines. With Conductor + Edgit:
# Deploy new version globally in < 50ms
edgit deploy set analysis-prompt v2.0.0 --to prod
# Instant rollback if quality drops
edgit deploy set analysis-prompt v1.0.0 --to prod
No build step. No container deployment. No waiting.
Geographic Distribution
Test variants by region automatically:
// Cloudflare Workers automatically provides request.cf.colo
export default {
async fetch(request: Request, env: Env) {
const colo = request.cf?.colo; // Airport code (e.g., "SJC")
// Route US West to variant A, rest to variant B
const variant = ['SJC', 'LAX', 'SEA'].includes(colo)
? 'variant-a'
: 'variant-b';
const promptVersion = env[`PROMPT_VERSION_${variant.toUpperCase()}`];
return conductorClient.execute({
ensemble: 'company-analysis',
input: { ...input, promptVersion }
});
}
};
Edge Advantage: 300+ locations worldwide, no latency penalty for experimentation.
User-Based Routing
Consistent user experience with deterministic routing:
function getVariant(userId: string): 'a' | 'b' {
// Deterministic hash ensures same user always gets same variant
const hash = hashUserId(userId);
return hash % 100 < 50 ? 'a' : 'b';
}
export default {
async fetch(request: Request, env: Env) {
const userId = request.headers.get('x-user-id');
const variant = getVariant(userId);
return conductorClient.execute({
ensemble: 'company-analysis',
input: { ...input, variant }
});
}
};
Measuring Results
Built-in Quality Scoring
Conductor’s scoring system provides automatic quality metrics:
scoring:
enabled: true
defaultThresholds:
minimum: 0.7
target: 0.85
flow:
- member: analyze
scoring:
evaluator: validate
thresholds:
minimum: 0.8
criteria:
accuracy: "Analysis must be factually accurate"
completeness: "All required sections present"
Every execution emits:
- Quality score (0.0 - 1.0)
- Execution time
- Token usage / cost
- Cache hit/miss
- Retry count
Analytics Engine Integration
Log experiment data to Cloudflare Analytics Engine:
// Log A/B test results
env.ANALYTICS?.writeDataPoint({
blobs: [
ensembleName,
variant,
userId
],
doubles: [
executionTime,
qualityScore,
cost
],
indexes: [
variant // Fast filtering by variant
]
});
Query results:
SELECT
variant,
AVG(double1) as avg_execution_time,
AVG(double2) as avg_quality_score,
AVG(double3) as avg_cost,
COUNT(*) as sample_size
FROM analytics
WHERE blob1 = 'company-analysis'
AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY variant
Custom Metrics
Track business outcomes:
const result = await conductorClient.execute({
ensemble: 'company-analysis',
input: { domain: 'acme.com' }
});
// Log business metrics
await logMetric({
variant: env.VARIANT,
qualityScore: result.metadata.scoring?.score,
executionTime: result.executionTime,
userConverted: await checkConversion(userId),
revenue: await getRevenue(userId),
timestamp: Date.now()
});
The Future: Autonomous Optimization
A/B testing is just the beginning. Here’s where Ensemble Edge is heading:
Multi-Armed Bandit
Automatically adjust traffic based on real-time results:
// Conductor learns optimal allocation
const bandit = new MultiArmedBandit({
variants: ['v1.0.0', 'v1.2.0', 'v2.0.0'],
metric: 'quality_score',
explorationRate: 0.1
});
// Starts 33/33/33, converges to optimal allocation
const variant = await bandit.selectVariant(context);
Coming in v1.1:
- Bayesian optimization
- Thompson sampling
- Contextual bandits (vary by user attributes)
Hyperparameter Tuning
Optimize LLM parameters automatically:
// Define search space
const searchSpace = {
temperature: [0.3, 0.5, 0.7, 0.9],
maxTokens: [1000, 2000, 4000],
topP: [0.8, 0.9, 0.95, 1.0]
};
// Conductor explores combinations
const optimizer = new GridSearch({
searchSpace,
metric: 'quality_score',
budget: 1000 // Max evaluations
});
await optimizer.optimize('analysis-prompt');
Coming in v1.2:
- Bayesian optimization for continuous parameters
- Early stopping for failed configurations
- Multi-objective optimization (quality + cost + speed)
Prompt Evolution
AI-generated prompt variants tested automatically:
// Conductor generates variants using meta-prompting
const promptEvolution = new PromptEvolution({
basePrompt: extractionPrompt,
objective: 'maximize accuracy on financial data',
generations: 5,
populationSize: 10,
mutationRate: 0.2
});
// Evolves prompts overnight, selects best performer
const optimizedPrompt = await promptEvolution.evolve();
Coming in v1.3:
- Genetic algorithms for prompt optimization
- Cross-breeding high-performing prompts
- Automatic evaluation against test suites
Edge-Native Gradient Descent
Optimize at the edge with zero central coordination:
┌─────────────────────────────────────────────────┐
│ Global Optimization │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ SJC │ │ IAD │ │ LHR │ │ SYD │ │
│ │Edge │ │Edge │ │Edge │ │Edge │ │
│ │ │ │ │ │ │ │ │ │
│ │Test │ │Test │ │Test │ │Test │ │
│ │Local │ │Local │ │Local │ │Local │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │
│ └─────────┴─────────┴─────────┘ │
│ Durable Objects │
│ Aggregate & Optimize │
└─────────────────────────────────────────────────┘
Each edge location:
- Tests configurations locally
- Reports results to Durable Object
- Receives updated optimal config
Zero latency penalty. Millions of experiments per day.
Composite Optimization
Optimize the entire stack simultaneously:
# Conductor optimizes across:
optimization:
dimensions:
- prompt_version # Which prompt
- model # Which LLM
- temperature # LLM parameter
- caching_strategy # Cache TTL
- parallel_degree # How much parallelization
- state_management # Shared state vs prop drilling
- retry_strategy # Exponential vs linear backoff
objectives:
- maximize: quality_score
- minimize: cost
- minimize: latency
constraints:
- quality_score >= 0.85
- latency <= 2000 # ms
- cost <= 0.10 # per request
Coming in v2.0:
- Pareto frontier optimization
- Multi-objective evolutionary algorithms
- Automatic constraint satisfaction
Why Edge Makes This Possible
Traditional A/B testing platforms face fundamental limitations:
- Centralized coordination - Single point of failure
- Deployment latency - Minutes to hours to roll out variants
- Geographic bias - US-centric testing affects global users differently
- Scale limits - Expensive to test millions of combinations
- Manual analysis - Humans interpret results and make decisions
✅ Ensemble Edge
- Distributed execution - 300+ locations worldwide
- Instant deployment - < 50ms global rollout via KV
- Geographic fairness - Each region tests independently
- Unlimited scale - Edge workers handle millions of experiments
- Autonomous optimization - AI finds optimal configurations automatically
Best Practices
1. Start Simple
Begin with single-dimension tests:
# Test one thing at a time
edgit deploy set extraction-prompt v1.0.0 --to prod-a
edgit deploy set extraction-prompt v1.1.0 --to prod-b
# Once confident, expand
2. Use Quality Scoring
Let Conductor measure quality automatically:
scoring:
enabled: true
defaultThresholds:
minimum: 0.8 # Auto-fail below this
3. Monitor Business Metrics
Don’t optimize for proxy metrics:
// ✅ Good - Track revenue
await logMetric({ variant, revenue });
// ❌ Bad - Optimize engagement without checking conversion
await logMetric({ variant, clicks });
4. Set Sample Size Requirements
Ensure statistical significance:
// Require minimum sample size before declaring winner
const MIN_SAMPLES = 1000;
const CONFIDENCE = 0.95;
if (variantA.samples >= MIN_SAMPLES &&
variantB.samples >= MIN_SAMPLES) {
const pValue = tTest(variantA, variantB);
if (pValue < (1 - CONFIDENCE)) {
// Statistically significant
promoteWinner(variantA.score > variantB.score ? 'a' : 'b');
}
}
5. Use Gradual Rollout
Start with small traffic percentage:
// Week 1: 5% to variant B
if (Math.random() < 0.05) variant = 'b';
// Week 2: 25% to variant B (if successful)
if (Math.random() < 0.25) variant = 'b';
// Week 3: 100% to variant B (promote to default)
Roadmap
v1.1 (Q2 2025)
- ✅ Multi-armed bandit support
- ✅ Bayesian optimization
- ✅ Automatic traffic allocation
- ✅ Statistical significance testing
v1.2 (Q3 2025)
- ✅ Hyperparameter grid search
- ✅ Multi-objective optimization
- ✅ Contextual bandits
- ✅ Automatic rollback on quality degradation
v1.3 (Q4 2025)
- ✅ Prompt evolution with genetic algorithms
- ✅ Meta-prompting for variant generation
- ✅ Composite optimization across full stack
v2.0 (2026)
- ✅ Autonomous edge-native gradient descent
- ✅ Zero-latency optimization at 300+ locations
- ✅ Pareto frontier multi-objective optimization
- ✅ Learned optimization strategies per ensemble
Why This Matters
Traditional A/B testing: “Let’s test prompt A vs prompt B for 2 weeks, analyze results, pick winner, deploy, repeat.”
Ensemble Edge vision: “Deploy 50 variants globally, let edge locations test locally, AI finds optimal configuration in 48 hours, automatically promotes winner, continues optimizing forever.”
The difference: 10x faster iteration, 100x more experiments, 0x human coordination overhead.
A/B testing is not a feature—it’s a fundamental architectural capability of Ensemble Edge. By combining Git-native versioning with edge execution, we’re building toward optimization at a scale impossible with traditional platforms.This is just the beginning.