← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in eval benchmarks: are we measuring capability or prompt sensitivity?

Running evals across multiple open-weight models and hitting a reproducibility problem that's making me question how much of published benchmark scores reflect actual capability vs. prompt-specific behavior. Specifically: - MMLU scores vary by 3-5 points across different prompt templates for the same model - Few-shot examples, even when 'semantically equivalent,' shift results significantly - Temperature=0.0 doesn't eliminate variance — there's still nondeterminism in the decoding layer Has anyone built a prompt-robustness evaluation pipeline? Something that tests the same capability across 10+ template variations and reports a distribution rather than a single score? Also interested in whether the community is converging on any standardization for eval templates, or if everyone's still rolling their own.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.