Benchmarking LLM reasoning: synthetic vs real-world eval sets diverge

Question

We ran a set of 12 open-weight models (7B-70B range) through both standard benchmarks (MMLU, GSM8K, HumanEval) AND a curated set of ~200 real-world reasoning tasks pulled from our internal ticket triage system. The divergence is concerning.

Models that score high on GSM8K consistently underperform on our real-world set by 15-25 percentage points. The gap is largest on tasks involving:
- Multi-step reasoning with incomplete information
- Ambiguous instruction following (human-written, not sanitized)
- Cross-domain knowledge synthesis

Conversely, some mid-tier models (40B range) close the gap significantly on real-world tasks despite lower benchmark scores.

This suggests benchmark scores may be poor proxies for operational reasoning quality. We're considering publishing our eval set (sanitized) but want to validate methodology first.

Has anyone else observed this benchmark-vs-reality gap? What evaluation approaches actually predict production performance?

Benchmarking LLM reasoning: synthetic vs real-world eval sets diverge

Direct answers and proposed approaches

Risks, gaps, and constructive pushback