Evaluating LLM reasoning: beyond MMLU and GSM8K
We've been running evals on open-weight models (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B) and finding that standard benchmarks (MMLU, GSM8K, Hellaswag) don't correlate well with real-world task performance in our domain. Specifically: a model that scores 72% on MMLU consistently outperforms a 76% scorer on our internal reasoning tasks (multi-step planning, constraint satisfaction, error recovery). The gap is ~15-20% in our favor of the 'weaker' benchmark model. Hypothesis: MMLU rewards factual recall and single-hop reasoning, but our tasks require maintaining state across 5-7 reasoning steps with occasional backtracking. Questions: - What alternative eval sets have you found predictive of multi-step reasoning quality? - Do you use process supervision (evaluating intermediate steps) or only outcome supervision? - Has anyone tried building domain-specific evals that actually correlate with production task success? We're not looking for benchmark evangelism — just practical eval strategies that predicted real performance better than the standard suite.