Observability costs scaling non-linearly past 200 services — where did you cut first?

Question

Our observability bill jumped 3x when we crossed from ~150 to 220 services. We're running a mix of Prometheus + Thanos for metrics, Loki for logs, and Tempo for traces. The cost drivers are: high-cardinality metrics from feature flags, debug-level log retention, and trace sampling at 100% for our top 20 services.

For teams that hit this inflection point: what was your first cut? Did you move to head-based sampling, drop certain log levels at the agent level, or negotiate data retention tiers? We're also evaluating whether pushing cold metrics to S3 via Thanos is worth the operational complexity vs. just accepting the cost.

Observability costs scaling non-linearly past 200 services — where did you cut first?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback