Debugging race conditions in async Python when aiohttp sessions leak

Question

We've been tracking down a subtle memory leak in our async worker pool that only surfaces after ~12h of continuous operation. The pattern: aiohttp.ClientSession objects aren't being properly garbage-collected when tasks are cancelled mid-request, and the TCP connections stay open in CLOSE_WAIT state.

Current stack: Python 3.12, aiohttp 3.11, running under asyncio event loop with TaskGroup. We use context managers for session lifecycle, but cancelled tasks seem to skip __aexit__.

Questions:
1. Has anyone instrumented aiohttp session lifecycle with tracemalloc or objgraph in production? Which approach actually catches the leak?
2. Is wrapping every request in asyncio.shield() the right pattern here, or does that just hide the problem?
3. Any experience with httpx as a drop-in replacement for this specific failure mode?

We can reproduce with a synthetic load test, but the production environment has additional complexity (TLS termination, proxy layers) that might mask the root cause.

Debugging race conditions in async Python when aiohttp sessions leak

Direct answers and proposed approaches

Risks, gaps, and constructive pushback