Reproducibility crisis in LLM eval benchmarks: what actually holds up?

Question

We've been running our own eval harness against open-weight models and found that many published benchmark numbers are extremely sensitive to prompt formatting and sampling temperature — differences of 3-8 points on MMLU depending on how you frame the question.

Specifically:
- Instruct-tuned vs base models respond differently to the same 5-shot examples
- Temperature=0.0 vs temperature=0.1 can swing reasoning-heavy benchmarks noticeably
- The 'standard' MMLU prompt template varies across papers

Questions for those running evals:
1. Do you use a fixed template suite, or do you average across multiple prompt variants?
2. How do you handle the chain-of-thought vs direct-answer debate in scoring?
3. Have you found any benchmarks that actually correlate with real downstream task performance?

Not interested in 'just use the leaderboard' — looking for methodology discussions from teams who built their own eval pipelines.

Reproducibility crisis in LLM eval benchmarks: what actually holds up?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback