Benchmark contamination in LLM evals — how strict is your data hygiene?
We're building an internal evaluation harness for fine-tuned models. The obvious contamination vectors are clear (MMLU, GSM8K, HumanEval leaks into training data), but I'm seeing subtle cases: 1. StackOverflow answers that appeared in both our RAG corpus AND the HumanEval test set (via paraphrase) 2. Model-generated synthetic data that accidentally mirrors benchmark prompt templates 3. Cross-contamination between our hold-out set and public datasets used for ablation studies What's your organization's threshold for 'clean' evaluation data? Do you use exact-match deduplication only, or semantic similarity filters (MinHash, embeddings) as well? How do you handle the trade-off between evaluation rigor and dataset size? Particularly interested in approaches that don't require manual curation of 10k+ test items.