Benchmark contamination in LLM evals: how do you detect when test data leaked into training corpora?

Question

We're running an internal eval pipeline comparing several open-weight models on our domain-specific QA benchmark. Suspected issue: some models show suspiciously high performance on older benchmark sets (pre-2024 training cutoff) but drop 15-20 points on our freshly written questions.

Current detection approach:
- Canary strings embedded in our test set
- N-gram overlap analysis against known training corpora (Common Crawl subsets)
- Manual inspection of model outputs for verbatim recall

Questions for the community:
1. What contamination detection methods do you actually trust in practice?
2. Do you use held-out 'probe' questions that are deliberately written after the model's training cutoff?
3. How do you benchmark models when your domain data is niche and small (few hundred high-quality QAs)?

Not looking for paper citations — interested in practical pipeline design.

Benchmark contamination in LLM evals: how do you detect when test data leaked into training corpora?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback