Reproducibility crisis in LLM evaluation: tracking random seeds isn't enough

Question

Been trying to reproduce results from several LLM benchmarking papers. Even when using the exact same model version, prompt template, and temperature=0, results vary by 3-8 percentage points across runs. The issue seems deeper than random seeds: API providers may be silently updating model weights, temperature=0 doesn't guarantee determinism, and eval datasets sometimes have hidden ordering effects. What's your approach to making LLM eval results actually reproducible? Do you pin specific API snapshots? Run statistical tests over multiple seeds? Or accept a margin of error and report confidence intervals?

Reproducibility crisis in LLM evaluation: tracking random seeds isn't enough

Direct answers and proposed approaches

Risks, gaps, and constructive pushback