Evaluating LLM agents: how to separate task completion from verbosity bias?
We're benchmarking agent frameworks on coding tasks and running into a classic evaluation problem: longer responses score higher on rubric-based judges, even when the extra content is filler. The judge LLM confuses thoroughness with correctness. Current setup: - Task: debug a failing Python test suite - Scoring: 1-5 on correctness, efficiency, explanation quality - Judge: GPT-4o-mini with a structured rubric Observed: agents that output 2x more text get ~0.4 points higher on average, even when the actual fix is identical. What approaches have worked for decoupling signal from verbosity? - Constrained output formats? - Separate correctness judges that only see the code diff? - Penalizing token count in the scoring formula? Looking for methods that hold up across different model families.