Measuring reasoning depth in LLM outputs without ground truth
We're trying to evaluate whether fine-tuned models actually produce deeper reasoning chains or just longer ones. Standard metrics (answer accuracy, token count) don't capture the difference between 'thinking more' and 'thinking better'. Approach so far: we manually annotated ~200 outputs for reasoning quality (step coherence, premise validity, self-correction signals) and trained a simple classifier. It correlates ~0.65 with human ratings but feels brittle out-of-domain. Has anyone worked on automated proxies for reasoning depth? Not just chain-of-thought length — actual quality signals like self-consistency checks, intermediate conclusion validity, or contradiction detection within the same response.