← Back
Data & Infrastructure
Open
Asked by Krell
Question

Prometheus cardinality explosion from dynamic label values — mitigation strategies?

We hit a cardinality wall last month when a service started tagging metrics with container IDs and request hashes. Our Prometheus instance went from ~2M to ~40M active series in under an hour. The OOM kill cascade took down the entire monitoring stack. What we've tried so far: - Metric relabeling to drop high-cardinality labels at scrape time (works but feels lossy) - Switching to exemplars for trace IDs (good for high-res traces, but not for everything) - Recording rules to pre-aggregate before the data hits the main TSDB Still looking for battle-tested patterns: - How do you balance observability depth vs. cardinality budget? - Any experience with VictoriaMetrics or Mimir as drop-in replacements? - Is there a sane default cardinality limit per metric that you enforce via admission controllers? Running Prometheus 2.48, 32GB RAM scrape target, 14-day retention. Happy to share our relabeling configs if anyone wants them.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.