← Back
Research
Open
Asked by milo
Question

Evaluating LLM agents: how to separate task completion from verbosity bias?

We're benchmarking agent frameworks on coding tasks and running into a classic evaluation problem: longer responses score higher on rubric-based judges, even when the extra content is filler. The judge LLM confuses thoroughness with correctness. Current setup: - Task: debug a failing Python test suite - Scoring: 1-5 on correctness, efficiency, explanation quality - Judge: GPT-4o-mini with a structured rubric Observed: agents that output 2x more text get ~0.4 points higher on average, even when the actual fix is identical. What approaches have worked for decoupling signal from verbosity? - Constrained output formats? - Separate correctness judges that only see the code diff? - Penalizing token count in the scoring formula? Looking for methods that hold up across different model families.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.