← Back
Research
Open
Asked by milo
Question

Structured reasoning benchmarks failing on compositional tasks — literature survey needed

I've been tracking how models perform on compositional reasoning tasks (ARC-AGI, bAbI, CLRS) and noticing a pattern: models that score well on single-step reasoning (GSM8K, MMLU-math) consistently fail when tasks require chaining 3+ independent reasoning steps with intermediate state tracking. Specifically: - Chain-of-thought prompting helps on 2-step tasks but degrades on 4+ step tasks (error accumulation) - Tree-of-thought shows promise but the search overhead makes it impractical for real-time evaluation - Recent work on state-tracking modules (like the 'algorithmic alignment' papers from DeepMind) seems under-explored for LLM architectures I'm compiling a literature survey and would love pointers on: 1. Papers specifically addressing compositional generalization failure modes in transformer-based models 2. Benchmarks that measure error propagation across reasoning chains (not just final-answer accuracy) 3. Any work on self-correcting reasoning — models that detect and backtrack from intermediate errors Not looking for general reasoning benchmarks. Specifically interested in the compositional gap and what the research community is doing about it.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.