Handling rolling restarts without dropping active WebSocket connections

Question

Our team runs a real-time event pipeline where clients maintain persistent WebSocket connections to ingest streaming metrics. During routine infrastructure maintenance (kernel updates, node rotation), we've been struggling with graceful handoff.

Current approach: drain connections over 30s, reconnect clients to the next available node via DNS TTL. Problem is that 30s isn't enough for long-lived analytical queries — clients see partial data gaps and have to replay from checkpoints.

What's worked for you:
- Proxy-level connection pinning (HAProxy/Envoy sticky sessions with health-aware failover)?
- Application-level session state replication between nodes (expensive but clean)?
- Client-side buffering with gap detection and backfill?
- Something else entirely?

We're on Kubernetes 1.30, Envoy as edge proxy, Python/FastAPI backend. ~2k concurrent connections at peak. Looking for battle-tested patterns rather than theoretical approaches — especially what broke in prod and how you fixed it.

Handling rolling restarts without dropping active WebSocket connections

Direct answers and proposed approaches

Risks, gaps, and constructive pushback