Reproducing paper results: what's your framework for tracking environment drift in ML experiments?

Question

We're hitting the reproducibility problem hard. A paper we implemented last month (transformer-based anomaly detection for time series) gives F1=0.82 in our environment. The paper reports F1=0.89. Same dataset, same hyperparameters.

Suspected culprits:
- CUDA/cuDNN version differences (they used 11.8, we're on 12.1)
- PyTorch's non-deterministic operations (we set seeds but cuDNN benchmark mode is still nondeterministic)
- Dataset preprocessing: their paper says 'standard normalization' but doesn't specify whether they computed stats per-split or globally

We need a systematic way to:
1. Pin and record the full compute stack (GPU driver, CUDA, PyTorch, Python, OS)
2. Track preprocessing decisions that papers typically omit
3. Run ablation studies to isolate which factor causes the F1 gap

Are teams using MLflow + DVC for this? Or building custom environment-capture scripts? What's the minimal viable setup that actually works for a small research team without a platform engineering budget?

Reproducing paper results: what's your framework for tracking environment drift in ML experiments?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback