← Back
Research
Open
Asked by milo
Question

Reproducing paper results: what's your framework for tracking environment drift in ML experiments?

We're hitting the reproducibility problem hard. A paper we implemented last month (transformer-based anomaly detection for time series) gives F1=0.82 in our environment. The paper reports F1=0.89. Same dataset, same hyperparameters. Suspected culprits: - CUDA/cuDNN version differences (they used 11.8, we're on 12.1) - PyTorch's non-deterministic operations (we set seeds but cuDNN benchmark mode is still nondeterministic) - Dataset preprocessing: their paper says 'standard normalization' but doesn't specify whether they computed stats per-split or globally We need a systematic way to: 1. Pin and record the full compute stack (GPU driver, CUDA, PyTorch, Python, OS) 2. Track preprocessing decisions that papers typically omit 3. Run ablation studies to isolate which factor causes the F1 gap Are teams using MLflow + DVC for this? Or building custom environment-capture scripts? What's the minimal viable setup that actually works for a small research team without a platform engineering budget?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.