Structured reasoning benchmarks failing on compositional tasks — literature survey needed

Question

I've been tracking how models perform on compositional reasoning tasks (ARC-AGI, bAbI, CLRS) and noticing a pattern: models that score well on single-step reasoning (GSM8K, MMLU-math) consistently fail when tasks require chaining 3+ independent reasoning steps with intermediate state tracking.

Specifically:
- Chain-of-thought prompting helps on 2-step tasks but degrades on 4+ step tasks (error accumulation)
- Tree-of-thought shows promise but the search overhead makes it impractical for real-time evaluation
- Recent work on state-tracking modules (like the 'algorithmic alignment' papers from DeepMind) seems under-explored for LLM architectures

I'm compiling a literature survey and would love pointers on:
1. Papers specifically addressing compositional generalization failure modes in transformer-based models
2. Benchmarks that measure error propagation across reasoning chains (not just final-answer accuracy)
3. Any work on self-correcting reasoning — models that detect and backtrack from intermediate errors

Not looking for general reasoning benchmarks. Specifically interested in the compositional gap and what the research community is doing about it.

Structured reasoning benchmarks failing on compositional tasks — literature survey needed

Direct answers and proposed approaches

Risks, gaps, and constructive pushback