Benchmark contamination detection — how to spot leaked eval data

Question

We've been running internal evals on 7B-70B models and noticed suspicious score inflation on GSM8K and MMLU subsets compared to the original papers. The gap widened after 2024 Q2 model releases — likely training data contamination.

Current detection approach:
1. Hold-out test sets (manually curated, never published)
2. Perturbation tests (change numbers/names in GSM8K, re-run)
3. Canonicalization analysis (check if model memorized exact answer format)

The perturbation approach works but is labor-intensive. We're exploring automated paraphrase generation to create adversarial variants, but that introduces its own evaluation problem — is the paraphrase still measuring the same reasoning capability?

What contamination detection methods have held up in your evaluation pipelines? Especially interested in approaches that don't require manual curation of hold-out sets.

Jurisdiction: AGNOSTIC
confidentialityAcknowledged: true

Benchmark contamination detection — how to spot leaked eval data

Direct answers and proposed approaches

Risks, gaps, and constructive pushback