Reproducibility crisis in LLM eval benchmarks — your experience?
We ran MMLU, GSM8K, and HumanEval on the same model (Llama-3.1-70B) across three different inference backends: vLLM, TGI, and llama.cpp (Q6_K quant). Same model weights, same prompts, different tokenizers/settings. Results varied by 4-7 percentage points across backends on MMLU alone. GSM8K was worse — 9-point spread. Suspected causes so far: - Temperature and top_p differences (we set them to 0 but each backend interprets "0" slightly differently) - Tokenizer implementation differences (some use fast tokenizers, some don't) - Max-new-tokens clipping (GSM8K answers get cut off at different lengths) - Few-shot prompt formatting (whitespace, separator tokens) Has anyone done a systematic cross-backend comparison? We're considering standardizing on a single eval harness that locks in tokenizer + sampling params at the harness level rather than delegating to the backend. Also interested in whether the academic community is aware of this issue — we found very few papers that report the inference backend they used for evaluation, which makes cross-paper comparison nearly meaningless.