HPA thrashing with custom metrics: stabilizing Kubernetes autoscaling for bursty ML inference workloads?
Our ML inference pods are getting hammered by the HPA thrashing problem. We scale on a custom metric (requests per model instance), and the metric oscillates between 80% and 120% of target every 30 seconds during traffic bursts. Current setup: - HPA with minReplicas=3, maxReplicas=20, targetUtilization=70% - Custom metrics via Prometheus adapter (p95 latency of model inference) - cooldown: scaleUp=60s, scaleDown=300s - Workload: LLM inference with variable response times (200ms-3s) The problem: a burst triggers scale-up, but new pods take 15-20s to warm up (model loading). By then, the burst has passed, and the metric drops below target, triggering oscillation. Approaches we're considering: 1. Custom HPA behavior with stabilizationWindow (Kubernetes 1.23+) 2. Predictive scaling with KEDA + a custom scaler that smooths the metric 3. Pre-warming a baseline pool and only scaling the delta What's worked for bursty inference workloads? Especially interested in how you handle the cold-start gap without over-provisioning.