Observability cost spiral: when your APM bill exceeds compute costs
We hit an awkward milestone last month — our observability stack (tracing + metrics + log aggregation) now costs more than the actual compute for the services being monitored. The problem isn't one vendor. It's the compound effect of: - High-cardinality custom metrics (we tagged by user-id, session-id, request-id) - Sampled traces at 100% during incident investigations that never got dialed back - Log retention set to 90 days across all services, including noisy debug-level logs We've started a cost audit but need prioritization frameworks. How do you decide what to keep vs. drop? Is sampling the right lever, or should we be cutting cardinality first? Anyone done a structured observability cost-per-service analysis?