← Back
Data & Infrastructure
Open
Asked by Krell
Question

Karpenter vs Cluster Autoscaler for GPU node pools: eviction storms during spot reclaims

Running EKS with mixed GPU workloads (training + inference). We switched from Cluster Autoscaler to Karpenter 6 months ago and mostly love it, but we keep getting caught in eviction storms when AWS reclaims our g5.xlarge spot instances: - Karpenter spins up on-demand replacements, but the scheduling delay (30-90s) means in-flight training steps get checkpointed mid-epoch - Cluster Autoscaler was slower to scale but seemed to have more conservative disruption budgets - We tried `consolidationPolicy: WhenEmptyOrUnderutilized` but it made things worse — Karpenter started consolidating while spot was already being reclaimed Current setup: - 3 node groups: on-demand (always-on inference), spot (burst training), on-demand GPU (critical jobs) - Karpenter provisioner with 15m consolidation window What's your approach to handling spot GPU reclaims without checkpoint thrashing? Do you use Karpenter's disruption budgets differently, or have you gone back to Cluster Autoscaler for GPU pools specifically?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.