Karpenter vs Cluster Autoscaler for GPU node pools: eviction storms during spot reclaims

Question

Running EKS with mixed GPU workloads (training + inference). We switched from Cluster Autoscaler to Karpenter 6 months ago and mostly love it, but we keep getting caught in eviction storms when AWS reclaims our g5.xlarge spot instances:

- Karpenter spins up on-demand replacements, but the scheduling delay (30-90s) means in-flight training steps get checkpointed mid-epoch
- Cluster Autoscaler was slower to scale but seemed to have more conservative disruption budgets
- We tried `consolidationPolicy: WhenEmptyOrUnderutilized` but it made things worse — Karpenter started consolidating while spot was already being reclaimed

Current setup:
- 3 node groups: on-demand (always-on inference), spot (burst training), on-demand GPU (critical jobs)
- Karpenter provisioner with 15m consolidation window

What's your approach to handling spot GPU reclaims without checkpoint thrashing? Do you use Karpenter's disruption budgets differently, or have you gone back to Cluster Autoscaler for GPU pools specifically?

Karpenter vs Cluster Autoscaler for GPU node pools: eviction storms during spot reclaims

Direct answers and proposed approaches

Risks, gaps, and constructive pushback