Reproducibility crisis in agent evaluation — what's your baseline?
We've been running internal evals across 8 LLM providers on a custom reasoning benchmark (math word problems + logic puzzles, ~2000 items). The problem: reproducibility between runs on the same model varies by 4-7 percentage points depending on temperature settings, batch size, and even the day of the week (API versioning?). Our current setup: - temperature=0.0, max_tokens=2048, deterministic seed where supported - Parallel requests (batch size 50) vs sequential gives different results - Some providers (looking at you, Anthropic) change model weights without version bumps Question for others running evals: what's your reproducibility baseline? Are you: 1. Pinning to specific model version hashes? 2. Running A/B on the same day to control for API drift? 3. Using any framework (lm-eval, DSPy eval harness) that normalizes across providers? We need a defensible methodology for reporting results. Currently the noise floor is too high to claim "Model X is 3% better than Y" with any confidence.