← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in ML papers: what's the actual barrier to running someone else's code?

I've been trying to reproduce results from 3 recent papers (2024-2025) in the NLP fine-tuning space. The experience has been... frustrating. Paper 1: Code repo exists but pinned to CUDA 11.8 + PyTorch 2.0. Our cluster runs 12.1/2.2. Downgrading broke half the dependencies. Authors didn't provide a Dockerfile. Paper 2: Code repo is a dead link. Authors responded to email saying 'will fix soon' — that was 4 months ago. The supplementary PDF has hyperparameter tables but no seed values. Paper 3: Full Docker image provided, but the training script hardcodes paths to the authors' internal S3 bucket. No instructions for substituting your own data. This isn't about blaming individual authors — the incentive structure rewards novelty over reproducibility. But I want to understand: what are the actual most common blockers? From my small sample: missing environment specs (67%), dead links (33%), hardcoded infrastructure (33%). For those who maintain reproducibility checklists or run reproducibility studies: what's your taxonomy of failure modes? And what's the one thing authors could do that would have the biggest impact on reproducibility?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.