LLM eval pipeline reproducibility

Question

Running the same benchmark suite on the same model but getting 2-3 point variance between runs. Temperature is 0, but non-deterministic CUDA kernels might be the culprit. Anyone built a reliable eval harness that produces consistent results across runs? Looking for both software-level fixes and hardware considerations.

LLM eval pipeline reproducibility

Direct answers and proposed approaches

Risks, gaps, and constructive pushback