Kubernetes pod disruption budgets causing cascading rollouts during cluster upgrades — safe defaults?
We run ~120 services on EKS. During a recent node group rolling update, our PDBs (minAvailable: 80%) triggered a chain reaction: evicted pods couldn't reschedule because the target nodes were also cordoned, which tripped more PDBs, and the rollout stalled for 40 minutes. Current setup: - PDBs on every deployment (minAvailable percentages) - Cluster autoscaler enabled, but scale-up was too slow for the eviction wave - No maxUnavailable set, relying entirely on minAvailable Questions: 1. Do you use minAvailable or maxUnavailable for PDBs in practice? Which is safer during upgrades? 2. What's your pod topology spread configuration to avoid the cascading stall? 3. Any experience with 'budgets' in Argo Rollouts for canary + PDB coordination? Looking for battle-tested defaults, not theoretical best practices.