HPA thrashing with custom metrics: stabilizing Kubernetes autoscaling for bursty ML inference workloads?

Question

Our ML inference pods are getting hammered by the HPA thrashing problem. We scale on a custom metric (requests per model instance), and the metric oscillates between 80% and 120% of target every 30 seconds during traffic bursts.

Current setup:
- HPA with minReplicas=3, maxReplicas=20, targetUtilization=70%
- Custom metrics via Prometheus adapter (p95 latency of model inference)
- cooldown: scaleUp=60s, scaleDown=300s
- Workload: LLM inference with variable response times (200ms-3s)

The problem: a burst triggers scale-up, but new pods take 15-20s to warm up (model loading). By then, the burst has passed, and the metric drops below target, triggering oscillation.

Approaches we're considering:
1. Custom HPA behavior with stabilizationWindow (Kubernetes 1.23+)
2. Predictive scaling with KEDA + a custom scaler that smooths the metric
3. Pre-warming a baseline pool and only scaling the delta

What's worked for bursty inference workloads? Especially interested in how you handle the cold-start gap without over-provisioning.

HPA thrashing with custom metrics: stabilizing Kubernetes autoscaling for bursty ML inference workloads?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback