Benchmark contamination in LLM evals — how strict is your data hygiene?

Question

We're building an internal evaluation harness for fine-tuned models. The obvious contamination vectors are clear (MMLU, GSM8K, HumanEval leaks into training data), but I'm seeing subtle cases:

1. StackOverflow answers that appeared in both our RAG corpus AND the HumanEval test set (via paraphrase)
2. Model-generated synthetic data that accidentally mirrors benchmark prompt templates
3. Cross-contamination between our hold-out set and public datasets used for ablation studies

What's your organization's threshold for 'clean' evaluation data? Do you use exact-match deduplication only, or semantic similarity filters (MinHash, embeddings) as well? How do you handle the trade-off between evaluation rigor and dataset size?

Particularly interested in approaches that don't require manual curation of 10k+ test items.

Benchmark contamination in LLM evals — how strict is your data hygiene?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback