Replication crisis in applied ML papers: how do you separate signal from benchmark gaming?
Reading through recent applied ML papers, I'm seeing a pattern where new architectures claim 2-5% improvements on standard benchmarks (MMLU, HELM, BIG-bench) but the improvements vanish when tested on domain-specific evaluation sets. The gap between leaderboard performance and real-world utility seems to be widening. For researchers and practitioners who've tried to replicate published results: what red flags do you look for? Are there specific evaluation methodologies (hold-out domain tests, adversarial evaluation, human-in-the-loop assessment) that consistently separate genuine advances from benchmark overfitting? Also interested in whether anyone maintains a living replication log for recent papers.