The problem
Most teams integrate a single LLM provider, wrap it in an API endpoint, and call it done. This works until it doesn't — and "doesn't" usually means a Thursday afternoon when OpenAI's API starts returning 503s, your latency triples, and your product team discovers that your entire chat feature has a single point of failure.
At Ruby Labs, we needed something fundamentally different. Our AI chat platform serves millions of users across multiple products, each with different latency requirements, cost constraints, and model preferences. A single-provider architecture was never going to work.
Architecture overview
The system is built around three core abstractions:
- Provider Registry — a typed catalog of every available model and its capabilities
- Router — decides which provider handles each request based on cost, latency, and availability
- Failover Manager — detects degradation and reroutes traffic in real-time
interface ProviderConfig {
id: string;
models: ModelCapability[];
regions: Region[];
costPer1kTokens: { input: number; output: number };
avgLatencyMs: number;
maxConcurrent: number;
}
class ProviderRegistry {
private providers = new Map<string, ProviderConfig>();
register(config: ProviderConfig): void {
this.providers.set(config.id, config);
}
getAvailable(capability: string, region: Region): ProviderConfig[] {
return [...this.providers.values()]
.filter(p => p.models.some(m => m.capability === capability))
.filter(p => p.regions.includes(region))
.sort((a, b) => a.avgLatencyMs - b.avgLatencyMs);
}
}Provider abstraction
Every provider implements a common interface. This sounds obvious, but the devil is in the details — streaming behavior, token counting, rate limiting, and error shapes all differ wildly between providers.
interface AIProvider {
readonly id: string;
chat(params: ChatParams): Promise<ChatResponse>;
stream(params: ChatParams): AsyncIterable<StreamChunk>;
countTokens(messages: Message[]): number;
healthCheck(): Promise<HealthStatus>;
}
type ChatParams = {
messages: Message[];
model: string;
temperature?: number;
maxTokens?: number;
signal?: AbortSignal;
};We normalize everything at the adapter level. Each provider adapter handles its own quirks — Anthropic's message format, OpenAI's function calling schema, Cohere's chat vs. generate distinction — and exposes a clean, unified API upstream.
Token counting
Token counting is provider-specific. We maintain a local tokenizer cache per model family to avoid round-trips:
const tokenizerCache = new Map<string, Tokenizer>();
function getTokenizer(modelFamily: string): Tokenizer {
if (!tokenizerCache.has(modelFamily)) {
tokenizerCache.set(modelFamily, loadTokenizer(modelFamily));
}
return tokenizerCache.get(modelFamily)!;
}Failover strategy
Failover is the hardest part to get right. You need to detect degradation before users notice, reroute traffic smoothly, and recover gracefully when the original provider comes back.
Our failover system uses three signals:
- Error rate — EWMA over 30-second windows
- Latency percentiles — p95 latency compared to baseline
- Health checks — active probing every 10 seconds
class FailoverManager {
private errorRates = new Map<string, EWMA>();
private readonly THRESHOLD = 0.15; // 15% error rate triggers failover
recordOutcome(providerId: string, success: boolean, latencyMs: number): void {
const ewma = this.errorRates.get(providerId) ?? new EWMA(0.2);
ewma.add(success ? 0 : 1);
this.errorRates.set(providerId, ewma);
if (ewma.value > this.THRESHOLD) {
this.triggerFailover(providerId);
}
}
private triggerFailover(providerId: string): void {
// Move traffic to next-best provider
// Log alert to ops channel
// Start recovery probe cycle
}
}Latency benchmarks
After six months in production, here are our p50/p95 latency numbers across regions:
| Region | p50 (ms) | p95 (ms) | Primary Provider |
|---|---|---|---|
| EU-West | 142 | 310 | Anthropic |
| US-East | 98 | 245 | OpenAI |
| AP-Southeast | 187 | 420 | |
| EU-Central | 155 | 335 | Anthropic |
The multi-provider approach actually improved our overall latency because we can route to the geographically closest provider with the best current performance.
Lessons learned
After running this system in production for over a year, here are the key takeaways:
- Provider diversity is reliability. Having 25+ providers isn't about using them all — it's about always having 3-4 excellent options for any given request.
- Cost optimization is a routing problem. By routing cheaper models for simpler queries and premium models for complex ones, we reduced costs by 40% without impacting quality.
- Monitor token economics, not just latency. Cost per conversation matters more than cost per request.
- Test failover constantly. We run chaos engineering exercises weekly, randomly degrading providers to verify our failover paths work.
The full architecture is more nuanced than what I've covered here — there's caching, A/B testing integration via GrowthBook, and a cost attribution pipeline — but the core pattern of registry, router, and failover manager has proven remarkably stable under production load.