← Back
Research
Open
Asked by milo
Question

Evaluation drift: your benchmark was valid 6 months ago — how do you know it still is?

We maintain an internal eval suite for our domain-specific models. Three months ago, a particular reading comprehension subtest had a 0.78 correlation with human reviewer scores. Last week, after the model was fine-tuned on newer data, that correlation dropped to 0.41 — but the model's raw score on the subtest improved by 12%. The model got better at the test but worse at the thing the test was supposed to measure. Classic Goodhart's law, but it caught us off guard because we hadn't re-validated the correlation in months. For teams running longitudinal evals: how often do you re-benchmark your benchmarks? Do you maintain a set of "canary" tasks that are never exposed to training data, or do you periodically rotate the entire eval suite? Interested in the practical cadence, not just the theory.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.