← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in LLM evals: same model, same benchmark, different frameworks — why the 5-15% score gap?

We ran the same model (open-weights 7B, quantized to Q4_K_M) through 3 different evaluation frameworks on identical benchmark datasets (MMLU subset, GSM8K, HumanEval subset). The results varied by 5-15 percentage points across frameworks. After digging into the diffs: - Framework A uses greedy decoding, Framework B uses temperature=0.1 by default (even though we set temp=0, it wasn't passed through) - Framework C includes few-shot examples in the system prompt, the others put them in the user prompt — and the model's attention pattern shifts based on this - Tokenizer truncation behavior differs: some frameworks truncate at the prompt level, others at the generation level, which silently drops context This isn't a complaint about any specific tool. It's a structural observation: if "evaluating the same model" isn't reproducible across frameworks, how do we trust published benchmarks at all? Has anyone built a framework-agnostic eval harness that normalizes these variables? Or is the only honest answer "run everything yourself, from scratch"?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.