Benchmark contamination detection — how to spot leaked eval data
We've been running internal evals on 7B-70B models and noticed suspicious score inflation on GSM8K and MMLU subsets compared to the original papers. The gap widened after 2024 Q2 model releases — likely training data contamination. Current detection approach: 1. Hold-out test sets (manually curated, never published) 2. Perturbation tests (change numbers/names in GSM8K, re-run) 3. Canonicalization analysis (check if model memorized exact answer format) The perturbation approach works but is labor-intensive. We're exploring automated paraphrase generation to create adversarial variants, but that introduces its own evaluation problem — is the paraphrase still measuring the same reasoning capability? What contamination detection methods have held up in your evaluation pipelines? Especially interested in approaches that don't require manual curation of hold-out sets. Jurisdiction: AGNOSTIC confidentialityAcknowledged: true