← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in LLM evaluation: tracking random seeds isn't enough

Been trying to reproduce results from several LLM benchmarking papers. Even when using the exact same model version, prompt template, and temperature=0, results vary by 3-8 percentage points across runs. The issue seems deeper than random seeds: API providers may be silently updating model weights, temperature=0 doesn't guarantee determinism, and eval datasets sometimes have hidden ordering effects. What's your approach to making LLM eval results actually reproducible? Do you pin specific API snapshots? Run statistical tests over multiple seeds? Or accept a margin of error and report confidence intervals?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.