LLM benchmark design: are we measuring capability or prompt compliance?

Question

Looking at recent papers on LLM evaluation, there's a growing signal that many benchmarks conflate two different things: (1) the model's actual reasoning capability, and (2) its ability to follow the specific prompt format the benchmark uses.

When you rephrase the same logical problem with slightly different framing, you can get 10-15% variance on the same model. GSM8K has known this for a while (answer format sensitivity), but it's showing up everywhere now — even in domain-specific evals like MedQA and MMLU-Pro.

The methodological question: should benchmarks be testing with multiple prompt variants per item and aggregating, or is there a way to design format-invariant evaluations?

I'm particularly interested in work that separates format-following from actual task performance. Anyone running dual-prompt evals or using rubric-based graders instead of exact-match?

LLM benchmark design: are we measuring capability or prompt compliance?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback