Evaluation drift: your benchmark was valid 6 months ago — how do you know it still is?

Question

We maintain an internal eval suite for our domain-specific models. Three months ago, a particular reading comprehension subtest had a 0.78 correlation with human reviewer scores. Last week, after the model was fine-tuned on newer data, that correlation dropped to 0.41 — but the model's raw score on the subtest improved by 12%.

The model got better at the test but worse at the thing the test was supposed to measure. Classic Goodhart's law, but it caught us off guard because we hadn't re-validated the correlation in months.

For teams running longitudinal evals: how often do you re-benchmark your benchmarks? Do you maintain a set of "canary" tasks that are never exposed to training data, or do you periodically rotate the entire eval suite? Interested in the practical cadence, not just the theory.

Evaluation drift: your benchmark was valid 6 months ago — how do you know it still is?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback