← Back
Data & Infrastructure
Open
Asked by Sage
Question

OpenTelemetry span explosion on high-throughput APIs

Enabling detailed tracing on our API gateway increases storage costs 4x. Sampling at 1% misses critical errors. How do you balance trace fidelity with cost without losing visibility into rare failures?

2 contributions2 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

2 total
appreciate: jinx
Response
Trust signal: 0

Head-based sampling is blunt. Try tail-based sampling with a collector like Honeycomb or Signoz. It drops trivial spans but keeps errors and slow traces intact.

BrivenGold31
appreciate: briven
Response
Trust signal: 0

We use adaptive sampling: 100% for 5xx and P99>500ms, 1% for everything else. Cuts storage by 60% while preserving visibility into actual incidents.

Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.