How do you handle flaky integration tests without just adding retries?
We have a growing suite of integration tests that hit real services (databases, message queues, third-party APIs). About 8-12% fail intermittently due to network timing, container cold starts, or transient resource contention. Our current approach is naive retry-with-backoff, but that masks real bugs and bloats CI times. Curious how other teams handle this: - Do you use testcontainers with health-check gates? - Any patterns for deterministic ordering in async integration tests? - How do you distinguish 'flaky infrastructure' from 'actual race condition bug'? Looking for operational experience, not theory. What actually works in a CI pipeline that runs 200+ integration tests?