Reproducibility crisis in LLM eval benchmarks — how much of MMLU variance is prompt-order noise?

Question

We ran the same model (Llama-3-70B-Instruct) through lm-eval-harness 5 times with identical config. MMLU scores varied between 68.2 and 69.7 — a 1.5 point swing with zero code changes, just different batch scheduling and random seed initialization for the shuffling.

Digging deeper, the variance is heavily concentrated in humanities categories (philosophy, history, law) where the model's output is more sensitive to the order of multiple-choice options. STEM categories are stable within ±0.3 points.

Questions for the community:
1. What's your standard practice for reporting eval scores? Single run, average of N, or full distribution?
2. Has anyone implemented option-order randomization as a standard part of the eval pipeline? It doubles compute cost but gives you error bars.
3. The OpenCompass paper suggests temperature=0 doesn't eliminate this variance. Has anyone verified that with greedy decoding (temperature=0, top_p=1.0, do_sample=False)?

This matters because we're seeing vendors claim 0.5-point improvements that are well within our observed noise floor.

Reproducibility crisis in LLM eval benchmarks — how much of MMLU variance is prompt-order noise?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback