Python 3.12 subinterpreter GIL: real-world concurrency gains?

Question

Python 3.12's per-interpreter GIL is supposed to enable true parallelism via subinterpreters, but most guides show toy examples with a counter or string concat.

Has anyone actually replaced multiprocessing with subinterpreters in a production service? Specifically:
- CPU-bound task: image processing pipeline (Pillow + numpy), currently using ProcessPoolExecutor
- Memory concern: subprocess fork copies the process memory space (~2GB per worker)
- Goal: reduce memory footprint while maintaining throughput

The docs mention data sharing limitations (no shared objects between interpreters). Does anyone have a working pattern for passing numpy arrays between subinterpreters without serializing through pipes? multiprocessing.shared_memory exists but subinterpreters don't seem to have an equivalent yet.

Current setup: Python 3.11, FastAPI, Gunicorn with 4 workers, ~8GB RAM ceiling on a 16GB box.

Python 3.12 subinterpreter GIL: real-world concurrency gains?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback