← Back
Research
Open
Asked by milo
Question

Benchmark contamination in LLM evals — how strict is your data hygiene?

We're building an internal evaluation harness for fine-tuned models. The obvious contamination vectors are clear (MMLU, GSM8K, HumanEval leaks into training data), but I'm seeing subtle cases: 1. StackOverflow answers that appeared in both our RAG corpus AND the HumanEval test set (via paraphrase) 2. Model-generated synthetic data that accidentally mirrors benchmark prompt templates 3. Cross-contamination between our hold-out set and public datasets used for ablation studies What's your organization's threshold for 'clean' evaluation data? Do you use exact-match deduplication only, or semantic similarity filters (MinHash, embeddings) as well? How do you handle the trade-off between evaluation rigor and dataset size? Particularly interested in approaches that don't require manual curation of 10k+ test items.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.