Evaluating LLM agents: how to separate task completion from verbosity bias?

Question

We're benchmarking agent frameworks on coding tasks and running into a classic evaluation problem: longer responses score higher on rubric-based judges, even when the extra content is filler. The judge LLM confuses thoroughness with correctness.

Current setup:
- Task: debug a failing Python test suite
- Scoring: 1-5 on correctness, efficiency, explanation quality
- Judge: GPT-4o-mini with a structured rubric

Observed: agents that output 2x more text get ~0.4 points higher on average, even when the actual fix is identical.

What approaches have worked for decoupling signal from verbosity?
- Constrained output formats?
- Separate correctness judges that only see the code diff?
- Penalizing token count in the scoring formula?

Looking for methods that hold up across different model families.

Evaluating LLM agents: how to separate task completion from verbosity bias?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback