What is Routing?
Routing determines how Conductor connects to AI providers (OpenAI, Anthropic, Workers AI, Groq). The right routing mode can dramatically improve performance, reduce costs, and increase reliability.cloudflare
Platform-native Workers AI with ultra-low latency
cloudflare-gateway
AI Gateway with caching, analytics, and cost controls
direct
Direct API calls to OpenAI, Anthropic, Groq, etc.
Three Routing Modes
1. Cloudflare (Platform-Native)
For: Workers AI models running on Cloudflare’s network- ⚡ Ultra-fast - Sub-10ms latency to model
- 💰 Cost-effective - Cloudflare’s pricing (often free tier)
- 🔐 No API keys - Uses Workers AI binding
- 🌍 Edge execution - Runs closest to your users
- 📦 Smaller models - 7B-70B parameter range
- Latency is critical (< 50ms cold start)
- Cost optimization for high-volume workloads
- Simple tasks (summarization, classification, extraction)
- No external API key management desired
- Only Workers AI models available
- Smaller context windows than GPT-4/Claude
- Less sophisticated reasoning for complex tasks
2. Cloudflare Gateway (Recommended)
For: OpenAI, Anthropic, Groq through AI Gateway- 🗄️ Persistent cache - Cache spans deployments and users
- 📊 Real-time analytics - Dashboard for costs, latency, errors
- 💵 Cost controls - Set spending limits and rate limits
- 🔄 Retry logic - Automatic retries with exponential backoff
- 🌐 Multi-provider - OpenAI, Anthropic, Groq, etc.
- Always for production AI calls (unless using Workers AI)
- Need caching across requests and deployments
- Want visibility into AI spending and usage
- Multiple environments (dev, staging, prod)
- A/B testing different models or prompts
Persistent Caching
Persistent Caching
Cache survives deployments. First user pays, everyone else benefits. Can reduce costs by 90%+ for repeated queries.
Analytics Dashboard
Analytics Dashboard
Real-time metrics on:
- Cache hit rates
- Request volume
- Cost per model
- Latency percentiles
- Error rates by provider
Cost Controls
Cost Controls
Set hard limits:
- Max spend per day/month
- Rate limits per user
- Alert thresholds
- Budget allocation by environment
A/B Testing
A/B Testing
Split traffic between:
- Different models (GPT-4 vs Claude)
- Different prompts
- Different temperatures
- Track which performs better
3. Direct
For: Direct API calls bypassing AI Gateway- 🎯 Direct connection - No intermediary
- 🆕 Latest features - Provider-specific capabilities
- 🔧 Full control - All provider parameters available
- ❌ No gateway benefits - No cache, analytics, or limits
- Testing new provider features not yet in gateway
- Provider-specific parameters needed
- Debugging provider-specific issues
- Very low request volume (caching not beneficial)
- Miss out on persistent caching
- No analytics or cost controls
- Manual retry logic needed
- Direct API keys required
Configuration Examples
Workers AI (Platform-Native)
OpenAI via Gateway
Anthropic via Gateway
Groq via Gateway
Direct OpenAI (No Gateway)
Multi-Model Strategies
Cascade Pattern
Start with fast/cheap model, escalate to powerful model if needed:Load Balancing
Distribute requests across providers:Fallback Pattern
Try primary provider, fallback to secondary if unavailable:Cost Optimization
Use cheaper models for simple tasks, expensive for complex:Performance Comparison
Latency
| Routing Mode | Cold Start | Warm | Cache Hit |
|---|---|---|---|
| cloudflare | < 50ms | < 10ms | < 5ms |
| cloudflare-gateway | < 100ms | 500-2000ms | < 5ms |
| direct | < 100ms | 500-2000ms | N/A |
Cost (per 1K tokens)
| Provider | Model | Approx Cost |
|---|---|---|
| Workers AI | llama-3.1-8b | $0.001 |
| OpenAI | gpt-4o | $0.005 |
| OpenAI | gpt-4o-mini | $0.0002 |
| Anthropic | claude-3.5-sonnet | $0.003 |
| Groq | llama-3.1-70b | $0.0008 |
- Effective cost: 10% of base cost
- Example: gpt-4o at $0.0005 per 1K tokens
Reliability
| Routing Mode | Caching | Retry | Analytics | Rate Limiting |
|---|---|---|---|---|
| cloudflare | KV only | Manual | Basic | Platform |
| cloudflare-gateway | Persistent | Automatic | Full | Configurable |
| direct | None | Manual | None | Provider |
Best Practices
1. Default to AI Gateway
2. Use Workers AI for Simple Tasks
3. Configure Appropriate Cache TTL
4. Monitor Gateway Analytics
5. Use Environment-Specific Routing
Troubleshooting
Gateway Not Caching
Symptom: Every request hits provider API Causes:- Gateway not configured in wrangler.toml
- Cache TTL set to 0
- Requests have unique parameters (temperature varies)
- Using streaming (not cacheable)
High Latency
Symptom: Requests taking > 5s Causes:- Using direct routing (no edge optimization)
- Cold start + large model
- No caching enabled
- Provider issues
Rate Limit Errors
Symptom: 429 errors from provider Causes:- Exceeding provider limits
- No rate limiting in gateway
- Burst traffic
Cost Overruns
Symptom: Higher than expected AI costs Causes:- Low cache hit rate
- Using expensive models unnecessarily
- No spending limits

