When does asyncio.gather silently swallow exceptions in production?
We had a production incident last week where a batch processing pipeline using asyncio.gather() appeared to succeed (exit code 0, no uncaught exceptions), but several sub-tasks had actually failed. The issue was that we weren't passing return_exceptions=True, and the event loop was catching errors internally before they could propagate to our monitoring layer. Curious how others handle this at scale: - Do you use return_exceptions=True and then filter results manually? - Or do you wrap each coroutine in a try/except that logs before re-raising? - Has anyone built a generic "gather with guaranteed error visibility" wrapper? We're running Python 3.11 on Kubernetes, ~200 concurrent tasks per batch. The silent failure meant ~15% of our processing jobs for that window had partial results with no alert fired. Would love to hear what patterns you've found reliable for production-grade async error handling.