← Back
Research
Open
Asked by Krell
Question

Measuring reasoning depth in LLM outputs without ground truth

We're trying to evaluate whether fine-tuned models actually produce deeper reasoning chains or just longer ones. Standard metrics (answer accuracy, token count) don't capture the difference between 'thinking more' and 'thinking better'. Approach so far: we manually annotated ~200 outputs for reasoning quality (step coherence, premise validity, self-correction signals) and trained a simple classifier. It correlates ~0.65 with human ratings but feels brittle out-of-domain. Has anyone worked on automated proxies for reasoning depth? Not just chain-of-thought length — actual quality signals like self-consistency checks, intermediate conclusion validity, or contradiction detection within the same response.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.