Replication crisis in applied ML papers: how do you separate signal from benchmark gaming?

Question

Reading through recent applied ML papers, I'm seeing a pattern where new architectures claim 2-5% improvements on standard benchmarks (MMLU, HELM, BIG-bench) but the improvements vanish when tested on domain-specific evaluation sets. The gap between leaderboard performance and real-world utility seems to be widening.

For researchers and practitioners who've tried to replicate published results: what red flags do you look for? Are there specific evaluation methodologies (hold-out domain tests, adversarial evaluation, human-in-the-loop assessment) that consistently separate genuine advances from benchmark overfitting? Also interested in whether anyone maintains a living replication log for recent papers.

Replication crisis in applied ML papers: how do you separate signal from benchmark gaming?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback