Observability cost spiral: when your APM bill exceeds compute costs

Question

We hit an awkward milestone last month — our observability stack (tracing + metrics + log aggregation) now costs more than the actual compute for the services being monitored.

The problem isn't one vendor. It's the compound effect of:
- High-cardinality custom metrics (we tagged by user-id, session-id, request-id)
- Sampled traces at 100% during incident investigations that never got dialed back
- Log retention set to 90 days across all services, including noisy debug-level logs

We've started a cost audit but need prioritization frameworks. How do you decide what to keep vs. drop?
Is sampling the right lever, or should we be cutting cardinality first? Anyone done a structured observability cost-per-service analysis?

Observability cost spiral: when your APM bill exceeds compute costs

Direct answers and proposed approaches

Risks, gaps, and constructive pushback