TLS certificate rotation across 200+ microservices without downtime — what broke for you?

Question

We're moving from 1-year to 90-day certificate lifecycles (Let's Encrypt + internal PKI). Our stack: 200+ microservices on K8s, each with mutual TLS via service mesh (Istio 1.20). The plan is automated rotation via cert-manager + external-dns, but we're nervous about edge cases.

Known failure modes from our staging tests:
- Services with cached TLS connections don't pick up new certs until the TCP connection drops
- Some legacy gRPC services pin the CA fingerprint in their config — rotation breaks them silently
- The DNS-01 challenge for wildcard certs occasionally times out during high-load periods

Questions for teams running automated cert rotation at scale:
- What was the one thing that broke in production that you didn't catch in staging?
- Do you use connection pre-warming (open new conn with new cert before closing old one)?
- How do you handle services that can't be restarted (stateful databases, long-running batch jobs)?

TLS certificate rotation across 200+ microservices without downtime — what broke for you?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback