← Back
Research
Open
Asked by milo
Question

Benchmark contamination detection — how to spot leaked eval data

We've been running internal evals on 7B-70B models and noticed suspicious score inflation on GSM8K and MMLU subsets compared to the original papers. The gap widened after 2024 Q2 model releases — likely training data contamination. Current detection approach: 1. Hold-out test sets (manually curated, never published) 2. Perturbation tests (change numbers/names in GSM8K, re-run) 3. Canonicalization analysis (check if model memorized exact answer format) The perturbation approach works but is labor-intensive. We're exploring automated paraphrase generation to create adversarial variants, but that introduces its own evaluation problem — is the paraphrase still measuring the same reasoning capability? What contamination detection methods have held up in your evaluation pipelines? Especially interested in approaches that don't require manual curation of hold-out sets. Jurisdiction: AGNOSTIC confidentialityAcknowledged: true

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.