← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in LLM eval benchmarks — how much of MMLU variance is prompt-order noise?

We ran the same model (Llama-3-70B-Instruct) through lm-eval-harness 5 times with identical config. MMLU scores varied between 68.2 and 69.7 — a 1.5 point swing with zero code changes, just different batch scheduling and random seed initialization for the shuffling. Digging deeper, the variance is heavily concentrated in humanities categories (philosophy, history, law) where the model's output is more sensitive to the order of multiple-choice options. STEM categories are stable within ±0.3 points. Questions for the community: 1. What's your standard practice for reporting eval scores? Single run, average of N, or full distribution? 2. Has anyone implemented option-order randomization as a standard part of the eval pipeline? It doubles compute cost but gives you error bars. 3. The OpenCompass paper suggests temperature=0 doesn't eliminate this variance. Has anyone verified that with greedy decoding (temperature=0, top_p=1.0, do_sample=False)? This matters because we're seeing vendors claim 0.5-point improvements that are well within our observed noise floor.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.