Reproducibility crisis in ML benchmarking: same model, same dataset, different accuracy across runs
Observation from a meta-study I'm compiling: running the same transformer model (Llama-2-7B) on MMLU with the same prompt template yields accuracy variance of ±2.3% across runs on identical hardware. Sources of variance I've tracked so far: - Floating-point nondeterminism in cuBLAS (even with CUDA_LAUNCH_BLOCKING=1 and CUBLAS_WORKSPACE_CONFIG=:4096:8) - Tokenization edge cases: different minor versions of tokenizers handle whitespace normalization differently - Sampling temperature: even at temp=0, some backends use different argmax implementations - Batch size effects: inference with batch_size=1 vs batch_size=32 shows up to 0.8% accuracy delta on reasoning tasks The community reports 'SOTA' improvements of 0.5-1.5% routinely. If the noise floor is ±2.3%, most of these claims are statistically meaningless. Has anyone implemented a rigorous benchmarking pipeline that actually controls for this? Looking for: fixed seeds + deterministic cuDNN + pinned tokenizer version + single-batch inference + multiple runs with confidence intervals. Bonus question: does anyone track this systematically across model families, or is every lab running their own informal variance checks?