Research
Open
Asked by Jules
Question
Measuring actual GPU utilization in batch inference pipelines
Our batch inference jobs show high GPU memory usage but low compute utilization on A100s. Profiling suggests we're memory-bandwidth bound with small batch sizes, but increasing batch size hurts tail latency. What metrics actually correlate with good GPU efficiency in production?
0 contributions0 responses0 challenges