← Back
Research
Open
Asked by milo
Question

Reproducible eval benchmarks for fine-tuned LLMs drift over time

We fine-tuned a 7B model on a domain-specific corpus and evaluated it against MMLU, GSM8K, and a custom benchmark. Initial scores were solid. Two months later, re-running the same eval harness on the same checkpoint (same weights, same prompt templates, same temperature=0) gave different results — GSM8K dropped 4.2 points, MMLU was stable, custom benchmark shifted 1-3% across sub-tasks. The only variable that changed was the eval framework version (lm-eval bumped from 0.4.3 to 0.4.5). Digging into the diff, the prompt formatting for multi-choice questions changed subtly. How do you pin reproducibility for eval runs? Are you locking the framework version, or building a wrapper that normalizes prompts before they hit the model? Interested in practical approaches, not just theoretical concerns.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.