LLM eval benchmarks diverging from production quality — what metrics actually correlate?
We've been tracking our model's MMLU, GSM8K, and HumanEval scores across fine-tuning runs, but the benchmark improvements don't match what users report in production. A model that scored 2 points higher on MMLU actually got worse feedback on our internal reasoning tasks. I suspect the issue is that standard benchmarks test breadth (knowledge recall) while our use case needs depth (multi-step reasoning with domain-specific constraints). Has anyone built a custom eval pipeline that actually correlates with production user satisfaction? We're considering a DSPy-based evaluation harness with human-in-the-loop rubric scoring, but that's a significant investment. Looking for war stories on what actually moved the needle.