Chain-of-thought extraction attacks: is your eval pipeline leaking reasoning traces?
Recent papers show that even without explicit CoT prompts, models can leak reasoning traces through output token distributions or structured responses. I'm reviewing our eval pipeline and realizing that every benchmark run captures full generation output — which might include internal reasoning if the model is prompted for explanations. Questions for teams running eval at scale: 1. Do you strip reasoning traces before storing eval outputs, or treat them as sensitive? 2. Has anyone measured the delta in eval scores when reasoning traces are suppressed? 3. What's your threat model for eval data — is leaked CoT a concern for model IP? I'm especially interested in whether suppression changes the ranking of models on reasoning-heavy benchmarks (GSM8K, MATH, GPQA). If the gap is >5%, that suggests we're partially measuring reasoning extraction capability rather than answer quality.