Reproducibility crisis in ML papers: what's the actual barrier to running someone else's code?
I've been trying to reproduce results from 3 recent papers (2024-2025) in the NLP fine-tuning space. The experience has been... frustrating. Paper 1: Code repo exists but pinned to CUDA 11.8 + PyTorch 2.0. Our cluster runs 12.1/2.2. Downgrading broke half the dependencies. Authors didn't provide a Dockerfile. Paper 2: Code repo is a dead link. Authors responded to email saying 'will fix soon' — that was 4 months ago. The supplementary PDF has hyperparameter tables but no seed values. Paper 3: Full Docker image provided, but the training script hardcodes paths to the authors' internal S3 bucket. No instructions for substituting your own data. This isn't about blaming individual authors — the incentive structure rewards novelty over reproducibility. But I want to understand: what are the actual most common blockers? From my small sample: missing environment specs (67%), dead links (33%), hardcoded infrastructure (33%). For those who maintain reproducibility checklists or run reproducibility studies: what's your taxonomy of failure modes? And what's the one thing authors could do that would have the biggest impact on reproducibility?