Routing - Ensemble

What is Routing?

Routing determines how Conductor connects to AI providers (OpenAI, Anthropic, Workers AI, Groq). The right routing mode can dramatically improve performance, reduce costs, and increase reliability.

cloudflare

Platform-native Workers AI with ultra-low latency

cloudflare-gateway

AI Gateway with caching, analytics, and cost controls

direct

Direct API calls to OpenAI, Anthropic, Groq, etc.

Three Routing Modes

1. Cloudflare (Platform-Native)

For: Workers AI models running on Cloudflare’s network

- member: generate-summary
  type: Think
  config:
    provider: workers-ai
    model: "@cf/meta/llama-3.1-8b-instruct"
    routing: cloudflare  # Platform-native
    temperature: 0.7

Characteristics:

⚡ Ultra-fast - Sub-10ms latency to model
💰 Cost-effective - Cloudflare’s pricing (often free tier)
🔐 No API keys - Uses Workers AI binding
🌍 Edge execution - Runs closest to your users
📦 Smaller models - 7B-70B parameter range

When to use:

Latency is critical (< 50ms cold start)
Cost optimization for high-volume workloads
Simple tasks (summarization, classification, extraction)
No external API key management desired

Limitations:

Only Workers AI models available
Smaller context windows than GPT-4/Claude
Less sophisticated reasoning for complex tasks

2. Cloudflare Gateway (Recommended)

For: OpenAI, Anthropic, Groq through AI Gateway

- member: generate-analysis
  type: Think
  config:
    provider: anthropic
    model: claude-3-5-sonnet-20241022
    routing: cloudflare-gateway  # Route through AI Gateway
    temperature: 0.7

Characteristics:

🗄️ Persistent cache - Cache spans deployments and users
📊 Real-time analytics - Dashboard for costs, latency, errors
💵 Cost controls - Set spending limits and rate limits
🔄 Retry logic - Automatic retries with exponential backoff
🌐 Multi-provider - OpenAI, Anthropic, Groq, etc.

When to use:

Always for production AI calls (unless using Workers AI)
Need caching across requests and deployments
Want visibility into AI spending and usage
Multiple environments (dev, staging, prod)
A/B testing different models or prompts

Benefits:

Persistent Caching

Cache survives deployments. First user pays, everyone else benefits. Can reduce costs by 90%+ for repeated queries.

Analytics Dashboard

Real-time metrics on:

Cache hit rates
Request volume
Cost per model
Latency percentiles
Error rates by provider

Cost Controls

Set hard limits:

Max spend per day/month
Rate limits per user
Alert thresholds
Budget allocation by environment

A/B Testing

Split traffic between:

Different models (GPT-4 vs Claude)
Different prompts
Different temperatures
Track which performs better

3. Direct

For: Direct API calls bypassing AI Gateway

- member: generate-code
  type: Think
  config:
    provider: openai
    model: o1-preview
    routing: direct  # Bypass gateway
    temperature: 1.0

Characteristics:

🎯 Direct connection - No intermediary
🆕 Latest features - Provider-specific capabilities
🔧 Full control - All provider parameters available
❌ No gateway benefits - No cache, analytics, or limits

When to use:

Testing new provider features not yet in gateway
Provider-specific parameters needed
Debugging provider-specific issues
Very low request volume (caching not beneficial)

Tradeoffs:

Miss out on persistent caching
No analytics or cost controls
Manual retry logic needed
Direct API keys required

Configuration Examples

Workers AI (Platform-Native)

name: edge-classification

flow:
  - member: classify-intent
    type: Think
    config:
      provider: workers-ai
      model: "@cf/meta/llama-3.1-8b-instruct"
      routing: cloudflare
      systemPrompt: "Classify user intent into: question, request, or complaint."

Setup required:

# wrangler.toml
[ai]
binding = "AI"

OpenAI via Gateway

name: generate-article

flow:
  - member: write-draft
    type: Think
    config:
      provider: openai
      model: gpt-4o
      routing: cloudflare-gateway
      temperature: 0.7
      maxTokens: 2000

Setup required:

# wrangler.toml
[[ai.gateway]]
id = "my-gateway"
cache_ttl = 3600

# Environment variable
export OPENAI_API_KEY="sk-..."

Anthropic via Gateway

name: analyze-sentiment

flow:
  - member: detect-sentiment
    type: Think
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      routing: cloudflare-gateway
      temperature: 0.3

Setup required:

# Environment variable
export ANTHROPIC_API_KEY="sk-ant-..."

Groq via Gateway

name: fast-extraction

flow:
  - member: extract-entities
    type: Think
    config:
      provider: groq
      model: llama-3.1-70b-versatile
      routing: cloudflare-gateway  # Ultra-fast with caching
      temperature: 0.2

Setup required:

# Environment variable
export GROQ_API_KEY="gsk_..."

Direct OpenAI (No Gateway)

name: test-new-model

flow:
  - member: try-o1
    type: Think
    config:
      provider: openai
      model: o1-preview
      routing: direct  # Bypass gateway for testing
      temperature: 1.0

Multi-Model Strategies

Cascade Pattern

Start with fast/cheap model, escalate to powerful model if needed:

flow:
  # Try fast edge model first
  - member: quick-answer
    config:
      provider: workers-ai
      model: "@cf/meta/llama-3.1-8b-instruct"
      routing: cloudflare
    scoring:
      thresholds:
        minimum: 0.7
      onFailure: continue  # Don't abort if low quality

  # Escalate to powerful model if needed
  - member: detailed-answer
    condition: ${quick-answer.scoring.score < 0.7}
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      routing: cloudflare-gateway

Load Balancing

Distribute requests across providers:

flow:
  - member: generate-summary
    config:
      # Alternate between providers
      provider: ${input.requestId % 2 === 0 ? 'openai' : 'anthropic'}
      model: ${input.requestId % 2 === 0 ? 'gpt-4o' : 'claude-3-5-sonnet-20241022'}
      routing: cloudflare-gateway

Fallback Pattern

Try primary provider, fallback to secondary if unavailable:

flow:
  - member: generate-content
    config:
      provider: openai
      model: gpt-4o
      routing: cloudflare-gateway
    retry:
      maxAttempts: 2
      onFailure: continue

  # Fallback if OpenAI fails
  - member: generate-content-fallback
    condition: ${!generate-content.success}
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      routing: cloudflare-gateway

Cost Optimization

Use cheaper models for simple tasks, expensive for complex:

flow:
  # Classification with edge model (cheap)
  - member: classify-intent
    config:
      provider: workers-ai
      model: "@cf/meta/llama-3.1-8b-instruct"
      routing: cloudflare

  # Complex reasoning with flagship model (expensive)
  - member: generate-strategy
    condition: ${classify-intent.output.intent === 'complex'}
    config:
      provider: anthropic
      model: claude-3-5-sonnet-20241022
      routing: cloudflare-gateway

Performance Comparison

Latency

Routing Mode	Cold Start	Warm	Cache Hit
cloudflare	< 50ms	< 10ms	< 5ms
cloudflare-gateway	< 100ms	500-2000ms	< 5ms
direct	< 100ms	500-2000ms	N/A

Cost (per 1K tokens)

Provider	Model	Approx Cost
Workers AI	llama-3.1-8b	$0.001
OpenAI	gpt-4o	$0.005
OpenAI	gpt-4o-mini	$0.0002
Anthropic	claude-3.5-sonnet	$0.003
Groq	llama-3.1-70b	$0.0008

With Gateway Cache (90% hit rate):

Effective cost: 10% of base cost
Example: gpt-4o at $0.0005 per 1K tokens

Reliability

Routing Mode	Caching	Retry	Analytics	Rate Limiting
cloudflare	KV only	Manual	Basic	Platform
cloudflare-gateway	Persistent	Automatic	Full	Configurable
direct	None	Manual	None	Provider

Best Practices

1. Default to AI Gateway

# ✅ Use gateway for production
config:
  routing: cloudflare-gateway

# ❌ Don't use direct without reason
config:
  routing: direct

2. Use Workers AI for Simple Tasks

# ✅ Edge model for classification
- member: classify-email
  config:
    provider: workers-ai
    routing: cloudflare

# ❌ Don't waste GPT-4 on simple tasks
- member: classify-email
  config:
    provider: openai
    model: gpt-4o  # Overkill

3. Configure Appropriate Cache TTL

# ✅ Long cache for stable queries
- member: summarize-docs
  config:
    routing: cloudflare-gateway
  cache:
    ttl: 86400  # 24 hours

# ✅ Short cache for dynamic data
- member: analyze-live-feed
  config:
    routing: cloudflare-gateway
  cache:
    ttl: 60  # 1 minute

4. Monitor Gateway Analytics

// Check dashboard regularly
// Cloudflare Dashboard -> AI Gateway
// - Cache hit rates (target: > 80%)
// - Error rates (target: < 1%)
// - Latency p95 (target: < 3s)
// - Cost trends

5. Use Environment-Specific Routing

flow:
  - member: generate-text
    config:
      # Development: use cheap edge models
      # Production: use powerful models via gateway
      provider: ${env.ENVIRONMENT === 'production' ? 'anthropic' : 'workers-ai'}
      model: ${env.ENVIRONMENT === 'production' ? 'claude-3-5-sonnet-20241022' : '@cf/meta/llama-3.1-8b-instruct'}
      routing: ${env.ENVIRONMENT === 'production' ? 'cloudflare-gateway' : 'cloudflare'}

Troubleshooting

Gateway Not Caching

Symptom: Every request hits provider API Causes:

Gateway not configured in wrangler.toml
Cache TTL set to 0
Requests have unique parameters (temperature varies)
Using streaming (not cacheable)

Fix:

# wrangler.toml
[[ai.gateway]]
id = "my-gateway"
cache_ttl = 3600  # Enable caching

High Latency

Symptom: Requests taking > 5s Causes:

Using direct routing (no edge optimization)
Cold start + large model
No caching enabled
Provider issues

Fix:

# Use gateway for caching
config:
  routing: cloudflare-gateway
cache:
  ttl: 3600  # Enable KV cache too

Rate Limit Errors

Symptom: 429 errors from provider Causes:

Exceeding provider limits
No rate limiting in gateway
Burst traffic

Fix:

# Configure rate limits in gateway
[[ai.gateway]]
id = "my-gateway"
rate_limiting_requests_per_minute = 60

Cost Overruns

Symptom: Higher than expected AI costs Causes:

Low cache hit rate
Using expensive models unnecessarily
No spending limits

Fix:

# Use edge models for simple tasks
- member: simple-task
  config:
    provider: workers-ai
    routing: cloudflare

# Set gateway spending limits via dashboard
# Cloudflare Dashboard -> AI Gateway -> Settings

AI Gateway Docs

Official Cloudflare AI Gateway documentation

Workers AI Docs

Workers AI models and usage

Caching Guide

Learn about caching strategies

AI Integration Guide

Complete guide to AI provider setup

Conductor

Core Concepts

Guides

Member Types

Built-In Members

Examples

API Reference

Deployment

​What is Routing?

cloudflare

cloudflare-gateway

direct

​Three Routing Modes

​1. Cloudflare (Platform-Native)

​2. Cloudflare Gateway (Recommended)

​3. Direct

​Configuration Examples

​Workers AI (Platform-Native)

​OpenAI via Gateway

​Anthropic via Gateway

​Groq via Gateway

​Direct OpenAI (No Gateway)

​Multi-Model Strategies

​Cascade Pattern

​Load Balancing

​Fallback Pattern

​Cost Optimization

​Performance Comparison

​Latency

​Cost (per 1K tokens)

​Reliability

​Best Practices

​1. Default to AI Gateway

​2. Use Workers AI for Simple Tasks

​3. Configure Appropriate Cache TTL

​4. Monitor Gateway Analytics

​5. Use Environment-Specific Routing

​Troubleshooting

​Gateway Not Caching

​High Latency

​Rate Limit Errors

​Cost Overruns

​Related Documentation

AI Gateway Docs

Workers AI Docs

Caching Guide

AI Integration Guide

What is Routing?

Three Routing Modes

1. Cloudflare (Platform-Native)

2. Cloudflare Gateway (Recommended)

3. Direct

Configuration Examples

Workers AI (Platform-Native)

OpenAI via Gateway

Anthropic via Gateway

Groq via Gateway

Direct OpenAI (No Gateway)

Multi-Model Strategies

Cascade Pattern

Load Balancing

Fallback Pattern

Cost Optimization

Performance Comparison

Latency

Cost (per 1K tokens)

Reliability

Best Practices

1. Default to AI Gateway

2. Use Workers AI for Simple Tasks

3. Configure Appropriate Cache TTL

4. Monitor Gateway Analytics

5. Use Environment-Specific Routing

Troubleshooting

Gateway Not Caching

High Latency

Rate Limit Errors

Cost Overruns

Related Documentation