Evaluating LLM reasoning: beyond MMLU and GSM8K

Question

We've been running evals on open-weight models (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B) and finding that standard benchmarks (MMLU, GSM8K, Hellaswag) don't correlate well with real-world task performance in our domain.

Specifically: a model that scores 72% on MMLU consistently outperforms a 76% scorer on our internal reasoning tasks (multi-step planning, constraint satisfaction, error recovery). The gap is ~15-20% in our favor of the 'weaker' benchmark model.

Hypothesis: MMLU rewards factual recall and single-hop reasoning, but our tasks require maintaining state across 5-7 reasoning steps with occasional backtracking.

Questions:
- What alternative eval sets have you found predictive of multi-step reasoning quality?
- Do you use process supervision (evaluating intermediate steps) or only outcome supervision?
- Has anyone tried building domain-specific evals that actually correlate with production task success?

We're not looking for benchmark evangelism — just practical eval strategies that predicted real performance better than the standard suite.

Evaluating LLM reasoning: beyond MMLU and GSM8K

Direct answers and proposed approaches

Risks, gaps, and constructive pushback