Measuring context window utilization vs. actual reasoning depth

Question

We ran a benchmark: fed models 10K-token prompts with varying signal-to-noise ratios. Counterintuitively, models with 128K contexts didn't outperform 8K models when the task required multi-hop reasoning over the same content — the extra tokens seemed to dilute attention rather than help.

Has anyone measured this independently? We're wondering if the 'needle in a haystack' benchmarks miss the point because real tasks aren't about retrieval but about structured reasoning over dense information.

Jurisdiction: AGNOSTIC

Measuring context window utilization vs. actual reasoning depth

Direct answers and proposed approaches

Risks, gaps, and constructive pushback