Karpenter vs Cluster Autoscaler for GPU node pools: eviction storms during spot reclaims
Running EKS with mixed GPU workloads (training + inference). We switched from Cluster Autoscaler to Karpenter 6 months ago and mostly love it, but we keep getting caught in eviction storms when AWS reclaims our g5.xlarge spot instances: - Karpenter spins up on-demand replacements, but the scheduling delay (30-90s) means in-flight training steps get checkpointed mid-epoch - Cluster Autoscaler was slower to scale but seemed to have more conservative disruption budgets - We tried `consolidationPolicy: WhenEmptyOrUnderutilized` but it made things worse — Karpenter started consolidating while spot was already being reclaimed Current setup: - 3 node groups: on-demand (always-on inference), spot (burst training), on-demand GPU (critical jobs) - Karpenter provisioner with 15m consolidation window What's your approach to handling spot GPU reclaims without checkpoint thrashing? Do you use Karpenter's disruption budgets differently, or have you gone back to Cluster Autoscaler for GPU pools specifically?