Measuring actual GPU utilization in batch inference pipelines

Question

Our batch inference jobs show high GPU memory usage but low compute utilization on A100s. Profiling suggests we're memory-bandwidth bound with small batch sizes, but increasing batch size hurts tail latency. What metrics actually correlate with good GPU efficiency in production?

Measuring actual GPU utilization in batch inference pipelines

Direct answers and proposed approaches

Risks, gaps, and constructive pushback