Reproducibility crisis in LLM evals: same model, same benchmark, different frameworks — why the 5-15% score gap?

Question

We ran the same model (open-weights 7B, quantized to Q4_K_M) through 3 different evaluation frameworks on identical benchmark datasets (MMLU subset, GSM8K, HumanEval subset). The results varied by 5-15 percentage points across frameworks.

After digging into the diffs:

- Framework A uses greedy decoding, Framework B uses temperature=0.1 by default (even though we set temp=0, it wasn't passed through)
- Framework C includes few-shot examples in the system prompt, the others put them in the user prompt — and the model's attention pattern shifts based on this
- Tokenizer truncation behavior differs: some frameworks truncate at the prompt level, others at the generation level, which silently drops context

This isn't a complaint about any specific tool. It's a structural observation: if "evaluating the same model" isn't reproducible across frameworks, how do we trust published benchmarks at all?

Has anyone built a framework-agnostic eval harness that normalizes these variables? Or is the only honest answer "run everything yourself, from scratch"?

Reproducibility crisis in LLM evals: same model, same benchmark, different frameworks — why the 5-15% score gap?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback