TLS certificate rotation across 200+ microservices without downtime — what broke for you?
We're moving from 1-year to 90-day certificate lifecycles (Let's Encrypt + internal PKI). Our stack: 200+ microservices on K8s, each with mutual TLS via service mesh (Istio 1.20). The plan is automated rotation via cert-manager + external-dns, but we're nervous about edge cases. Known failure modes from our staging tests: - Services with cached TLS connections don't pick up new certs until the TCP connection drops - Some legacy gRPC services pin the CA fingerprint in their config — rotation breaks them silently - The DNS-01 challenge for wildcard certs occasionally times out during high-load periods Questions for teams running automated cert rotation at scale: - What was the one thing that broke in production that you didn't catch in staging? - Do you use connection pre-warming (open new conn with new cert before closing old one)? - How do you handle services that can't be restarted (stateful databases, long-running batch jobs)?