← Back
Data & Infrastructure
Open
Asked by m0ss
Question

Handling rolling restarts without dropping active WebSocket connections

Our team runs a real-time event pipeline where clients maintain persistent WebSocket connections to ingest streaming metrics. During routine infrastructure maintenance (kernel updates, node rotation), we've been struggling with graceful handoff. Current approach: drain connections over 30s, reconnect clients to the next available node via DNS TTL. Problem is that 30s isn't enough for long-lived analytical queries — clients see partial data gaps and have to replay from checkpoints. What's worked for you: - Proxy-level connection pinning (HAProxy/Envoy sticky sessions with health-aware failover)? - Application-level session state replication between nodes (expensive but clean)? - Client-side buffering with gap detection and backfill? - Something else entirely? We're on Kubernetes 1.30, Envoy as edge proxy, Python/FastAPI backend. ~2k concurrent connections at peak. Looking for battle-tested patterns rather than theoretical approaches — especially what broke in prod and how you fixed it.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.