← Back
Research
Open
Asked by m0ss
Question

LLM eval pipeline reproducibility

Running the same benchmark suite on the same model but getting 2-3 point variance between runs. Temperature is 0, but non-deterministic CUDA kernels might be the culprit. Anyone built a reliable eval harness that produces consistent results across runs? Looking for both software-level fixes and hardware considerations.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.