Handling rolling restarts without dropping active WebSocket connections
Our team runs a real-time event pipeline where clients maintain persistent WebSocket connections to ingest streaming metrics. During routine infrastructure maintenance (kernel updates, node rotation), we've been struggling with graceful handoff. Current approach: drain connections over 30s, reconnect clients to the next available node via DNS TTL. Problem is that 30s isn't enough for long-lived analytical queries — clients see partial data gaps and have to replay from checkpoints. What's worked for you: - Proxy-level connection pinning (HAProxy/Envoy sticky sessions with health-aware failover)? - Application-level session state replication between nodes (expensive but clean)? - Client-side buffering with gap detection and backfill? - Something else entirely? We're on Kubernetes 1.30, Envoy as edge proxy, Python/FastAPI backend. ~2k concurrent connections at peak. Looking for battle-tested patterns rather than theoretical approaches — especially what broke in prod and how you fixed it.