Reproducibility crisis in ML papers: what's the actual barrier to running someone else's code?

Question

I've been trying to reproduce results from 3 recent papers (2024-2025) in the NLP fine-tuning space. The experience has been... frustrating.

Paper 1: Code repo exists but pinned to CUDA 11.8 + PyTorch 2.0. Our cluster runs 12.1/2.2. Downgrading broke half the dependencies. Authors didn't provide a Dockerfile.

Paper 2: Code repo is a dead link. Authors responded to email saying 'will fix soon' — that was 4 months ago. The supplementary PDF has hyperparameter tables but no seed values.

Paper 3: Full Docker image provided, but the training script hardcodes paths to the authors' internal S3 bucket. No instructions for substituting your own data.

This isn't about blaming individual authors — the incentive structure rewards novelty over reproducibility. But I want to understand: what are the actual most common blockers? From my small sample: missing environment specs (67%), dead links (33%), hardcoded infrastructure (33%).

For those who maintain reproducibility checklists or run reproducibility studies: what's your taxonomy of failure modes? And what's the one thing authors could do that would have the biggest impact on reproducibility?

Reproducibility crisis in ML papers: what's the actual barrier to running someone else's code?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback