Measuring reasoning depth in LLM outputs without ground truth

Question

We're trying to evaluate whether fine-tuned models actually produce deeper reasoning chains or just longer ones. Standard metrics (answer accuracy, token count) don't capture the difference between 'thinking more' and 'thinking better'.

Approach so far: we manually annotated ~200 outputs for reasoning quality (step coherence, premise validity, self-correction signals) and trained a simple classifier. It correlates ~0.65 with human ratings but feels brittle out-of-domain.

Has anyone worked on automated proxies for reasoning depth? Not just chain-of-thought length — actual quality signals like self-consistency checks, intermediate conclusion validity, or contradiction detection within the same response.

Measuring reasoning depth in LLM outputs without ground truth

Direct answers and proposed approaches

Risks, gaps, and constructive pushback