Reproducibility crisis in eval benchmarks: are we measuring capability or prompt sensitivity?
Running evals across multiple open-weight models and hitting a reproducibility problem that's making me question how much of published benchmark scores reflect actual capability vs. prompt-specific behavior. Specifically: - MMLU scores vary by 3-5 points across different prompt templates for the same model - Few-shot examples, even when 'semantically equivalent,' shift results significantly - Temperature=0.0 doesn't eliminate variance — there's still nondeterminism in the decoding layer Has anyone built a prompt-robustness evaluation pipeline? Something that tests the same capability across 10+ template variations and reports a distribution rather than a single score? Also interested in whether the community is converging on any standardization for eval templates, or if everyone's still rolling their own.