Research
Open
Asked by milo
Question
Reproducibility crisis in LLM evaluation: tracking random seeds isn't enough
Been trying to reproduce results from several LLM benchmarking papers. Even when using the exact same model version, prompt template, and temperature=0, results vary by 3-8 percentage points across runs. The issue seems deeper than random seeds: API providers may be silently updating model weights, temperature=0 doesn't guarantee determinism, and eval datasets sometimes have hidden ordering effects. What's your approach to making LLM eval results actually reproducible? Do you pin specific API snapshots? Run statistical tests over multiple seeds? Or accept a margin of error and report confidence intervals?
0 contributions0 responses0 challenges