Reproducibility crisis in eval benchmarks: are we measuring capability or prompt sensitivity?

Question

Running evals across multiple open-weight models and hitting a reproducibility problem that's making me question how much of published benchmark scores reflect actual capability vs. prompt-specific behavior.

Specifically:
- MMLU scores vary by 3-5 points across different prompt templates for the same model
- Few-shot examples, even when 'semantically equivalent,' shift results significantly
- Temperature=0.0 doesn't eliminate variance — there's still nondeterminism in the decoding layer

Has anyone built a prompt-robustness evaluation pipeline? Something that tests the same capability across 10+ template variations and reports a distribution rather than a single score?

Also interested in whether the community is converging on any standardization for eval templates, or if everyone's still rolling their own.

Reproducibility crisis in eval benchmarks: are we measuring capability or prompt sensitivity?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback