Building multi-provider AI chat systems at scale

The problem

Most teams integrate a single LLM provider, wrap it in an API endpoint, and call it done. This works until it doesn't — and "doesn't" usually means a Thursday afternoon when OpenAI's API starts returning 503s, your latency triples, and your product team discovers that your entire chat feature has a single point of failure.

At Ruby Labs, we needed something fundamentally different. Our AI chat platform serves millions of users across multiple products, each with different latency requirements, cost constraints, and model preferences. A single-provider architecture was never going to work.

Architecture overview

The system is built around three core abstractions:

Provider Registry — a typed catalog of every available model and its capabilities
Router — decides which provider handles each request based on cost, latency, and availability
Failover Manager — detects degradation and reroutes traffic in real-time

provider-registry.ts

typescript

interface ProviderConfig {
  id: string;
  models: ModelCapability[];
  regions: Region[];
  costPer1kTokens: { input: number; output: number };
  avgLatencyMs: number;
  maxConcurrent: number;
}
 
class ProviderRegistry {
  private providers = new Map<string, ProviderConfig>();
 
  register(config: ProviderConfig): void {
    this.providers.set(config.id, config);
  }
 
  getAvailable(capability: string, region: Region): ProviderConfig[] {
    return [...this.providers.values()]
      .filter(p => p.models.some(m => m.capability === capability))
      .filter(p => p.regions.includes(region))
      .sort((a, b) => a.avgLatencyMs - b.avgLatencyMs);
  }
}

Provider abstraction

Every provider implements a common interface. This sounds obvious, but the devil is in the details — streaming behavior, token counting, rate limiting, and error shapes all differ wildly between providers.

provider-interface.ts

typescript

interface AIProvider {
  readonly id: string;
 
  chat(params: ChatParams): Promise<ChatResponse>;
  stream(params: ChatParams): AsyncIterable<StreamChunk>;
 
  countTokens(messages: Message[]): number;
  healthCheck(): Promise<HealthStatus>;
}
 
type ChatParams = {
  messages: Message[];
  model: string;
  temperature?: number;
  maxTokens?: number;
  signal?: AbortSignal;
};

We normalize everything at the adapter level. Each provider adapter handles its own quirks — Anthropic's message format, OpenAI's function calling schema, Cohere's chat vs. generate distinction — and exposes a clean, unified API upstream.

Token counting

Token counting is provider-specific. We maintain a local tokenizer cache per model family to avoid round-trips:

tokenizer.ts

typescript

const tokenizerCache = new Map<string, Tokenizer>();
 
function getTokenizer(modelFamily: string): Tokenizer {
  if (!tokenizerCache.has(modelFamily)) {
    tokenizerCache.set(modelFamily, loadTokenizer(modelFamily));
  }
  return tokenizerCache.get(modelFamily)!;
}

Failover strategy

Failover is the hardest part to get right. You need to detect degradation before users notice, reroute traffic smoothly, and recover gracefully when the original provider comes back.

Our failover system uses three signals:

Error rate — EWMA over 30-second windows
Latency percentiles — p95 latency compared to baseline
Health checks — active probing every 10 seconds

failover-manager.ts

typescript

class FailoverManager {
  private errorRates = new Map<string, EWMA>();
  private readonly THRESHOLD = 0.15; // 15% error rate triggers failover
 
  recordOutcome(providerId: string, success: boolean, latencyMs: number): void {
    const ewma = this.errorRates.get(providerId) ?? new EWMA(0.2);
    ewma.add(success ? 0 : 1);
    this.errorRates.set(providerId, ewma);
 
    if (ewma.value > this.THRESHOLD) {
      this.triggerFailover(providerId);
    }
  }
 
  private triggerFailover(providerId: string): void {
    // Move traffic to next-best provider
    // Log alert to ops channel
    // Start recovery probe cycle
  }
}

Latency benchmarks

After six months in production, here are our p50/p95 latency numbers across regions:

Region	p50 (ms)	p95 (ms)	Primary Provider
EU-West	142	310	Anthropic
US-East	98	245	OpenAI
AP-Southeast	187	420	Google
EU-Central	155	335	Anthropic

The multi-provider approach actually improved our overall latency because we can route to the geographically closest provider with the best current performance.

Lessons learned

After running this system in production for over a year, here are the key takeaways:

Provider diversity is reliability. Having 25+ providers isn't about using them all — it's about always having 3-4 excellent options for any given request.
Cost optimization is a routing problem. By routing cheaper models for simpler queries and premium models for complex ones, we reduced costs by 40% without impacting quality.
Monitor token economics, not just latency. Cost per conversation matters more than cost per request.
Test failover constantly. We run chaos engineering exercises weekly, randomly degrading providers to verify our failover paths work.

The full architecture is more nuanced than what I've covered here — there's caching, A/B testing integration via GrowthBook, and a cost attribution pipeline — but the core pattern of registry, router, and failover manager has proven remarkably stable under production load.