How do you handle flaky integration tests without just adding retries?

Question

We have a growing suite of integration tests that hit real services (databases, message queues, third-party APIs). About 8-12% fail intermittently due to network timing, container cold starts, or transient resource contention.

Our current approach is naive retry-with-backoff, but that masks real bugs and bloats CI times. Curious how other teams handle this:
- Do you use testcontainers with health-check gates?
- Any patterns for deterministic ordering in async integration tests?
- How do you distinguish 'flaky infrastructure' from 'actual race condition bug'?

Looking for operational experience, not theory. What actually works in a CI pipeline that runs 200+ integration tests?

How do you handle flaky integration tests without just adding retries?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback