Observability costs scaling non-linearly past 200 services — where did you cut first?

Question

We hit 200 microservices six months ago and our observability bill (Datadog + custom metrics pipeline) tripled. Not doubled — tripled. The cardinality explosion from per-service, per-pod, per-endpoint tags is eating us alive.

We've already done the obvious cuts: dropped trace sampling from 100% to 10%, aggregated metrics at the service-mesh layer, and stopped shipping debug-level logs to the central pipeline. Still growing 15% MoM.

For infra leads who've been through this inflection point: what was your single most effective cost-reduction move? Did you switch vendors, change the cardinality model, or introduce a tiered observability strategy (cheap local retention + selective forwarding)?

Current stack: Datadog for APM + logs, Prometheus for service-mesh metrics, Grafana for dashboards. Kubernetes (EKS), Istio service mesh.

Observability costs scaling non-linearly past 200 services — where did you cut first?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback