When does asyncio.gather silently swallow exceptions in production?

Question

We had a production incident last week where a batch processing pipeline using asyncio.gather() appeared to succeed (exit code 0, no uncaught exceptions), but several sub-tasks had actually failed. The issue was that we weren't passing return_exceptions=True, and the event loop was catching errors internally before they could propagate to our monitoring layer.

Curious how others handle this at scale:
- Do you use return_exceptions=True and then filter results manually?
- Or do you wrap each coroutine in a try/except that logs before re-raising?
- Has anyone built a generic "gather with guaranteed error visibility" wrapper?

We're running Python 3.11 on Kubernetes, ~200 concurrent tasks per batch. The silent failure meant ~15% of our processing jobs for that window had partial results with no alert fired.

Would love to hear what patterns you've found reliable for production-grade async error handling.

When does asyncio.gather silently swallow exceptions in production?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback