Skip to content
Back to blog
AI4 perc olvasás

Building multi-provider AI chat systems at scale

How we architected a production AI chat platform handling 10K messages per second across 25+ model providers with automatic failover and cost-performance routing.

The problem

Most teams integrate a single LLM provider, wrap it in an API endpoint, and call it done. This works until it doesn't — and "doesn't" usually means a Thursday afternoon when OpenAI's API starts returning 503s, your latency triples, and your product team discovers that your entire chat feature has a single point of failure.

At Ruby Labs, we needed something fundamentally different. Our AI chat platform serves millions of users across multiple products, each with different latency requirements, cost constraints, and model preferences. A single-provider architecture was never going to work.

Architecture overview

The system is built around three core abstractions:

  1. Provider Registry — a typed catalog of every available model and its capabilities
  2. Router — decides which provider handles each request based on cost, latency, and availability
  3. Failover Manager — detects degradation and reroutes traffic in real-time
provider-registry.ts
typescript
interface ProviderConfig {
  id: string;
  models: ModelCapability[];
  regions: Region[];
  costPer1kTokens: { input: number; output: number };
  avgLatencyMs: number;
  maxConcurrent: number;
}
 
class ProviderRegistry {
  private providers = new Map<string, ProviderConfig>();
 
  register(config: ProviderConfig): void {
    this.providers.set(config.id, config);
  }
 
  getAvailable(capability: string, region: Region): ProviderConfig[] {
    return [...this.providers.values()]
      .filter(p => p.models.some(m => m.capability === capability))
      .filter(p => p.regions.includes(region))
      .sort((a, b) => a.avgLatencyMs - b.avgLatencyMs);
  }
}

Provider abstraction

Every provider implements a common interface. This sounds obvious, but the devil is in the details — streaming behavior, token counting, rate limiting, and error shapes all differ wildly between providers.

provider-interface.ts
typescript
interface AIProvider {
  readonly id: string;
 
  chat(params: ChatParams): Promise<ChatResponse>;
  stream(params: ChatParams): AsyncIterable<StreamChunk>;
 
  countTokens(messages: Message[]): number;
  healthCheck(): Promise<HealthStatus>;
}
 
type ChatParams = {
  messages: Message[];
  model: string;
  temperature?: number;
  maxTokens?: number;
  signal?: AbortSignal;
};

We normalize everything at the adapter level. Each provider adapter handles its own quirks — Anthropic's message format, OpenAI's function calling schema, Cohere's chat vs. generate distinction — and exposes a clean, unified API upstream.

Token counting

Token counting is provider-specific. We maintain a local tokenizer cache per model family to avoid round-trips:

tokenizer.ts
typescript
const tokenizerCache = new Map<string, Tokenizer>();
 
function getTokenizer(modelFamily: string): Tokenizer {
  if (!tokenizerCache.has(modelFamily)) {
    tokenizerCache.set(modelFamily, loadTokenizer(modelFamily));
  }
  return tokenizerCache.get(modelFamily)!;
}

Failover strategy

Failover is the hardest part to get right. You need to detect degradation before users notice, reroute traffic smoothly, and recover gracefully when the original provider comes back.

Our failover system uses three signals:

  • Error rate — EWMA over 30-second windows
  • Latency percentiles — p95 latency compared to baseline
  • Health checks — active probing every 10 seconds
failover-manager.ts
typescript
class FailoverManager {
  private errorRates = new Map<string, EWMA>();
  private readonly THRESHOLD = 0.15; // 15% error rate triggers failover
 
  recordOutcome(providerId: string, success: boolean, latencyMs: number): void {
    const ewma = this.errorRates.get(providerId) ?? new EWMA(0.2);
    ewma.add(success ? 0 : 1);
    this.errorRates.set(providerId, ewma);
 
    if (ewma.value > this.THRESHOLD) {
      this.triggerFailover(providerId);
    }
  }
 
  private triggerFailover(providerId: string): void {
    // Move traffic to next-best provider
    // Log alert to ops channel
    // Start recovery probe cycle
  }
}

Latency benchmarks

After six months in production, here are our p50/p95 latency numbers across regions:

Regionp50 (ms)p95 (ms)Primary Provider
EU-West142310Anthropic
US-East98245OpenAI
AP-Southeast187420Google
EU-Central155335Anthropic

The multi-provider approach actually improved our overall latency because we can route to the geographically closest provider with the best current performance.

Lessons learned

After running this system in production for over a year, here are the key takeaways:

  • Provider diversity is reliability. Having 25+ providers isn't about using them all — it's about always having 3-4 excellent options for any given request.
  • Cost optimization is a routing problem. By routing cheaper models for simpler queries and premium models for complex ones, we reduced costs by 40% without impacting quality.
  • Monitor token economics, not just latency. Cost per conversation matters more than cost per request.
  • Test failover constantly. We run chaos engineering exercises weekly, randomly degrading providers to verify our failover paths work.

The full architecture is more nuanced than what I've covered here — there's caching, A/B testing integration via GrowthBook, and a cost attribution pipeline — but the core pattern of registry, router, and failover manager has proven remarkably stable under production load.

#ai#llm#architecture#typescript#redis
Share