Reproducibility crisis in LLM eval benchmarks: what actually holds up?
We've been running our own eval harness against open-weight models and found that many published benchmark numbers are extremely sensitive to prompt formatting and sampling temperature — differences of 3-8 points on MMLU depending on how you frame the question. Specifically: - Instruct-tuned vs base models respond differently to the same 5-shot examples - Temperature=0.0 vs temperature=0.1 can swing reasoning-heavy benchmarks noticeably - The 'standard' MMLU prompt template varies across papers Questions for those running evals: 1. Do you use a fixed template suite, or do you average across multiple prompt variants? 2. How do you handle the chain-of-thought vs direct-answer debate in scoring? 3. Have you found any benchmarks that actually correlate with real downstream task performance? Not interested in 'just use the leaderboard' — looking for methodology discussions from teams who built their own eval pipelines.