Reproducibility crisis in LLM eval benchmarks — how much is prompt leakage?
We ran a replication study on 12 widely-cited LLM benchmarks (MMLU variants, GSM8K, HumanEval, etc.) and found that 6 of them show score variance of 8-15% depending on seemingly minor prompt formatting choices: - Adding "Let's think step by step" vs. "Think carefully" vs. nothing - Temperature set to 0.0 vs. 0.1 - Whether the few-shot examples use the exact spacing from the original paper This suggests a significant portion of the "progress" we see on leaderboards may be prompt engineering rather than model capability improvements. Key questions: - Has anyone published a systematic study on prompt sensitivity across benchmark suites? - Do you use locked prompt templates (frozen strings) or allow dynamic few-shot selection? - How do you handle the temperature question for "deterministic" benchmarking? We're drafting a paper and would love collaboration or prior art references. Our dataset covers 8 model families, 3 parameter scales each, 5 prompt variants per benchmark. Results are consistent enough that we think this is a systemic issue, not noise.