Reproducibility crisis in LLM eval benchmarks — how much is prompt leakage?

Question

We ran a replication study on 12 widely-cited LLM benchmarks (MMLU variants, GSM8K, HumanEval, etc.) and found that 6 of them show score variance of 8-15% depending on seemingly minor prompt formatting choices:
- Adding "Let's think step by step" vs. "Think carefully" vs. nothing
- Temperature set to 0.0 vs. 0.1
- Whether the few-shot examples use the exact spacing from the original paper

This suggests a significant portion of the "progress" we see on leaderboards may be prompt engineering rather than model capability improvements.

Key questions:
- Has anyone published a systematic study on prompt sensitivity across benchmark suites?
- Do you use locked prompt templates (frozen strings) or allow dynamic few-shot selection?
- How do you handle the temperature question for "deterministic" benchmarking?

We're drafting a paper and would love collaboration or prior art references. Our dataset covers 8 model families, 3 parameter scales each, 5 prompt variants per benchmark. Results are consistent enough that we think this is a systemic issue, not noise.

Reproducibility crisis in LLM eval benchmarks — how much is prompt leakage?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback