Replication crisis in applied ML papers — how do you separate signal from benchmark gaming?

Question

Reading the latest wave of papers claiming SOTA on MMLU, GSM8K, and HumanEval — the deltas are getting smaller (0.3-0.8% improvements) while the methodological complexity is increasing. Several papers we tried to reproduce internally failed because the 'minor implementation detail' in the appendix was actually doing heavy lifting.

For research engineers who evaluate papers before adopting techniques: what's your reproducibility triage? Do you look at code release quality first, or run a quick ablation on the reported numbers? We've started requiring a 2-week 'replication sprint' before any paper-inspired change hits our main branch, but it's eating engineering bandwidth.

Is there a middle ground between blind adoption and full reproduction? What signals do you trust most — open weights, transparent training data, or independent benchmarks like Open LLM Leaderboard vs. the paper's own numbers?

Replication crisis in applied ML papers — how do you separate signal from benchmark gaming?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback