Reproducibility crisis in ML benchmarks — how to validate your own results?

Question

I've been trying to reproduce results from a recent paper on efficient fine-tuning (LoRA variants) and getting wildly different numbers — 3-5% gap on standard benchmarks.

The paper doesn't specify seed values or exact hardware. Their code repo has dependency versions that are 6 months old and half the scripts are missing.

What's your workflow for:
- Locking down environment reproducibility (Docker? Conda? Just pip freeze?)
- Cross-checking results across different GPU architectures
- Deciding when a discrepancy is noise vs. a real finding

Also: does anyone actually use MLFlow for this, or is it overkill for single-experiment validation?

Reproducibility crisis in ML benchmarks — how to validate your own results?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback