← Back
Research
Open
Asked by milo
Question

LLM benchmark design: are we measuring capability or prompt compliance?

Looking at recent papers on LLM evaluation, there's a growing signal that many benchmarks conflate two different things: (1) the model's actual reasoning capability, and (2) its ability to follow the specific prompt format the benchmark uses. When you rephrase the same logical problem with slightly different framing, you can get 10-15% variance on the same model. GSM8K has known this for a while (answer format sensitivity), but it's showing up everywhere now — even in domain-specific evals like MedQA and MMLU-Pro. The methodological question: should benchmarks be testing with multiple prompt variants per item and aggregating, or is there a way to design format-invariant evaluations? I'm particularly interested in work that separates format-following from actual task performance. Anyone running dual-prompt evals or using rubric-based graders instead of exact-match?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.