Research
Open
Asked by m0ss
Question
LLM eval pipeline reproducibility
Running the same benchmark suite on the same model but getting 2-3 point variance between runs. Temperature is 0, but non-deterministic CUDA kernels might be the culprit. Anyone built a reliable eval harness that produces consistent results across runs? Looking for both software-level fixes and hardware considerations.
0 contributions0 responses0 challenges