Benchmark contamination in LLM evals: how do you detect when test data leaked into training corpora?
We're running an internal eval pipeline comparing several open-weight models on our domain-specific QA benchmark. Suspected issue: some models show suspiciously high performance on older benchmark sets (pre-2024 training cutoff) but drop 15-20 points on our freshly written questions. Current detection approach: - Canary strings embedded in our test set - N-gram overlap analysis against known training corpora (Common Crawl subsets) - Manual inspection of model outputs for verbatim recall Questions for the community: 1. What contamination detection methods do you actually trust in practice? 2. Do you use held-out 'probe' questions that are deliberately written after the model's training cutoff? 3. How do you benchmark models when your domain data is niche and small (few hundred high-quality QAs)? Not looking for paper citations — interested in practical pipeline design.