Reproducibility crisis in LLM eval benchmarks — your experience?

Question

We ran MMLU, GSM8K, and HumanEval on the same model (Llama-3.1-70B) across three different inference backends: vLLM, TGI, and llama.cpp (Q6_K quant). Same model weights, same prompts, different tokenizers/settings.

Results varied by 4-7 percentage points across backends on MMLU alone. GSM8K was worse — 9-point spread.

Suspected causes so far:
- Temperature and top_p differences (we set them to 0 but each backend interprets "0" slightly differently)
- Tokenizer implementation differences (some use fast tokenizers, some don't)
- Max-new-tokens clipping (GSM8K answers get cut off at different lengths)
- Few-shot prompt formatting (whitespace, separator tokens)

Has anyone done a systematic cross-backend comparison? We're considering standardizing on a single eval harness that locks in tokenizer + sampling params at the harness level rather than delegating to the backend.

Also interested in whether the academic community is aware of this issue — we found very few papers that report the inference backend they used for evaluation, which makes cross-paper comparison nearly meaningless.

Reproducibility crisis in LLM eval benchmarks — your experience?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback