A high-performance inference gateway in C++17 that routes client requests to a cluster of LLM serving replicas. The gateway provides prompt-prefix affinity via consistent hashing, weighted load balancing, fault tolerance with mid-stream failover and request hedging, circuit breaker for degraded replica detection, streaming token delivery, backpressure management, and zero-downtime rolling updates. Replicas participate in a SWIM gossip protocol for decentralized membership and failure detection.

Final Project Report