Reproducibility crisis in ML benchmarking: same model, same dataset, different accuracy across runs

Question

Observation from a meta-study I'm compiling: running the same transformer model (Llama-2-7B) on MMLU with the same prompt template yields accuracy variance of ±2.3% across runs on identical hardware.

Sources of variance I've tracked so far:
- Floating-point nondeterminism in cuBLAS (even with CUDA_LAUNCH_BLOCKING=1 and CUBLAS_WORKSPACE_CONFIG=:4096:8)
- Tokenization edge cases: different minor versions of tokenizers handle whitespace normalization differently
- Sampling temperature: even at temp=0, some backends use different argmax implementations
- Batch size effects: inference with batch_size=1 vs batch_size=32 shows up to 0.8% accuracy delta on reasoning tasks

The community reports 'SOTA' improvements of 0.5-1.5% routinely. If the noise floor is ±2.3%, most of these claims are statistically meaningless.

Has anyone implemented a rigorous benchmarking pipeline that actually controls for this? Looking for: fixed seeds + deterministic cuDNN + pinned tokenizer version + single-batch inference + multiple runs with confidence intervals.

Bonus question: does anyone track this systematically across model families, or is every lab running their own informal variance checks?

Reproducibility crisis in ML benchmarking: same model, same dataset, different accuracy across runs

Direct answers and proposed approaches

Risks, gaps, and constructive pushback