Why 9 regions?
When your users span from Tokyo to Sao Paulo, every millisecond of cache latency compounds. Our AI chat platform serves users globally, and the difference between a local Redis read (sub-1ms) and a cross-region one (50-200ms) is the difference between a snappy UX and a sluggish one.
We started with 3 regions, scaled to 6, and eventually landed on 9. Each expansion taught us something about the limits of distributed caching.
Replication topology
We use a hub-and-spoke replication model rather than mesh replication. Each region has a local Redis cluster that handles reads. Writes flow through a primary region and replicate outward.
┌──────────────┐
│ EU-West │
│ (Primary) │
└──────┬───────┘
│
┌───────────────┼───────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ US-East │ │ AP-South │ │ EU-Central │
│ (Replica) │ │ (Replica) │ │ (Replica) │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ US-West │ │ AP-East │
│ (Replica) │ │ (Replica) │
└─────────────┘ └─────────────┘The consistency problem
With asynchronous replication, you get eventual consistency. For a chat application, this creates a specific problem: a user sends a message in one region, but if they're load-balanced to a different region on the next request, they might not see their own message.
Our solution: read-your-writes consistency via a sticky session layer.
interface SessionContext {
userId: string;
primaryRegion: string;
lastWriteTimestamp: number;
writeRegion: string;
}
function resolveReadRegion(ctx: SessionContext, localRegion: string): string {
const timeSinceWrite = Date.now() - ctx.lastWriteTimestamp;
const replicationLag = getEstimatedLag(ctx.writeRegion, localRegion);
// If the write hasn't had time to replicate, read from write region
if (timeSinceWrite < replicationLag * 1.5) {
return ctx.writeRegion;
}
return localRegion;
}This gives us the performance benefits of local reads in the common case (reads far outnumber writes) while guaranteeing users always see their own writes.
Memory management at scale
Redis is an in-memory store, and memory is expensive at scale. Across 9 regions, every byte stored is multiplied by 9. We use several strategies to keep memory in check:
Key expiration policies
const CACHE_POLICIES = {
// Hot conversation data: keep for 24h
conversation: { ttl: 86400, maxMemory: "2gb" },
// User session data: keep for 1h
session: { ttl: 3600, maxMemory: "512mb" },
// AI response cache: keep for 6h
aiCache: { ttl: 21600, maxMemory: "4gb" },
// Rate limiting counters: keep for 1 minute
rateLimit: { ttl: 60, maxMemory: "256mb" },
} as const;Compression
For larger values (conversation histories, cached AI responses), we compress with LZ4 before storing. The CPU overhead is negligible compared to the memory savings:
import { compress, decompress } from "lz4-napi";
async function setCompressed(
redis: Redis,
key: string,
value: string,
ttl: number
): Promise<void> {
const buffer = Buffer.from(value);
if (buffer.byteLength > 1024) {
const compressed = await compress(buffer);
await redis.setex(`z:${key}`, ttl, compressed);
} else {
await redis.setex(key, ttl, value);
}
}Monitoring and alerting
With 9 regions, observability becomes critical. We monitor three key metrics per region:
- Replication lag — how far behind each replica is from the primary
- Memory utilization — approaching max triggers eviction warnings
- Command latency — p99 latency per command type
interface RegionHealth {
region: string;
replicationLagMs: number;
memoryUsagePercent: number;
commandLatencyP99Ms: number;
evictionsPerSecond: number;
status: "healthy" | "degraded" | "critical";
}
function evaluateHealth(metrics: RegionHealth): void {
if (metrics.replicationLagMs > 5000) {
alert(`High replication lag in ${metrics.region}: ${metrics.replicationLagMs}ms`);
}
if (metrics.evictionsPerSecond > 100) {
alert(`High eviction rate in ${metrics.region}: ${metrics.evictionsPerSecond}/s`);
}
}Operational patterns
Rolling deployments
Never upgrade all 9 regions simultaneously. We deploy to a canary region first, monitor for 30 minutes, then roll out to remaining regions in batches of 2:
- Canary (EU-West staging) — 30 min soak
- EU-West, EU-Central — 15 min soak
- US-East, US-West — 15 min soak
- AP regions — 15 min soak
Disaster recovery
Each region can operate independently if the primary goes down. We maintain a promotion runbook that can elevate any replica to primary in under 60 seconds. We drill this quarterly.
Results
After optimizing over 18 months:
| Metric | Before (3 regions) | After (9 regions) |
|---|---|---|
| Global p50 read latency | 23ms | 0.8ms |
| Global p95 read latency | 89ms | 4.2ms |
| Cache hit rate | 78% | 94% |
| Monthly Redis cost | $12,400 | $8,900 |
The cost actually went down because better cache locality meant fewer cache misses, which meant fewer expensive origin reads.
Key takeaways
- Start with hub-and-spoke. Mesh replication doesn't scale operationally beyond 3-4 nodes.
- Read-your-writes consistency is non-negotiable for user-facing applications.
- Compress early. The memory savings at 9x replication are massive.
- Monitor evictions, not just memory. Evictions are the canary in the coal mine.
- Drill your failover. A disaster recovery plan you haven't tested is a disaster recovery wish.
The distributed caching layer is one of those infrastructure investments that's invisible when it works well and catastrophic when it doesn't. Take the time to get it right.