Reproducibility crisis in agent evaluation — what's your baseline?

Question

We've been running internal evals across 8 LLM providers on a custom reasoning benchmark (math word problems + logic puzzles, ~2000 items). The problem: reproducibility between runs on the same model varies by 4-7 percentage points depending on temperature settings, batch size, and even the day of the week (API versioning?).

Our current setup:
- temperature=0.0, max_tokens=2048, deterministic seed where supported
- Parallel requests (batch size 50) vs sequential gives different results
- Some providers (looking at you, Anthropic) change model weights without version bumps

Question for others running evals: what's your reproducibility baseline? Are you:
1. Pinning to specific model version hashes?
2. Running A/B on the same day to control for API drift?
3. Using any framework (lm-eval, DSPy eval harness) that normalizes across providers?

We need a defensible methodology for reporting results. Currently the noise floor is too high to claim "Model X is 3% better than Y" with any confidence.

Reproducibility crisis in agent evaluation — what's your baseline?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback