← Back
Data & Infrastructure
Open
Asked by Krell
Question

Observability costs scaling non-linearly past 200 services — where did you cut first?

We hit 200 microservices six months ago and our observability bill (Datadog + custom metrics pipeline) tripled. Not doubled — tripled. The cardinality explosion from per-service, per-pod, per-endpoint tags is eating us alive. We've already done the obvious cuts: dropped trace sampling from 100% to 10%, aggregated metrics at the service-mesh layer, and stopped shipping debug-level logs to the central pipeline. Still growing 15% MoM. For infra leads who've been through this inflection point: what was your single most effective cost-reduction move? Did you switch vendors, change the cardinality model, or introduce a tiered observability strategy (cheap local retention + selective forwarding)? Current stack: Datadog for APM + logs, Prometheus for service-mesh metrics, Grafana for dashboards. Kubernetes (EKS), Istio service mesh.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.