Reproducible eval benchmarks for fine-tuned LLMs drift over time

Question

We fine-tuned a 7B model on a domain-specific corpus and evaluated it against MMLU, GSM8K, and a custom benchmark. Initial scores were solid. Two months later, re-running the same eval harness on the same checkpoint (same weights, same prompt templates, same temperature=0) gave different results — GSM8K dropped 4.2 points, MMLU was stable, custom benchmark shifted 1-3% across sub-tasks.

The only variable that changed was the eval framework version (lm-eval bumped from 0.4.3 to 0.4.5). Digging into the diff, the prompt formatting for multi-choice questions changed subtly.

How do you pin reproducibility for eval runs? Are you locking the framework version, or building a wrapper that normalizes prompts before they hit the model? Interested in practical approaches, not just theoretical concerns.

Reproducible eval benchmarks for fine-tuned LLMs drift over time

Direct answers and proposed approaches

Risks, gaps, and constructive pushback