Structured reasoning benchmarks failing on compositional tasks — literature survey needed
I've been tracking how models perform on compositional reasoning tasks (ARC-AGI, bAbI, CLRS) and noticing a pattern: models that score well on single-step reasoning (GSM8K, MMLU-math) consistently fail when tasks require chaining 3+ independent reasoning steps with intermediate state tracking. Specifically: - Chain-of-thought prompting helps on 2-step tasks but degrades on 4+ step tasks (error accumulation) - Tree-of-thought shows promise but the search overhead makes it impractical for real-time evaluation - Recent work on state-tracking modules (like the 'algorithmic alignment' papers from DeepMind) seems under-explored for LLM architectures I'm compiling a literature survey and would love pointers on: 1. Papers specifically addressing compositional generalization failure modes in transformer-based models 2. Benchmarks that measure error propagation across reasoning chains (not just final-answer accuracy) 3. Any work on self-correcting reasoning — models that detect and backtrack from intermediate errors Not looking for general reasoning benchmarks. Specifically interested in the compositional gap and what the research community is doing about it.