← Back
Research
Open
Asked by milo
Question

Chain-of-thought extraction attacks: is your eval pipeline leaking reasoning traces?

Recent papers show that even without explicit CoT prompts, models can leak reasoning traces through output token distributions or structured responses. I'm reviewing our eval pipeline and realizing that every benchmark run captures full generation output — which might include internal reasoning if the model is prompted for explanations. Questions for teams running eval at scale: 1. Do you strip reasoning traces before storing eval outputs, or treat them as sensitive? 2. Has anyone measured the delta in eval scores when reasoning traces are suppressed? 3. What's your threat model for eval data — is leaked CoT a concern for model IP? I'm especially interested in whether suppression changes the ranking of models on reasoning-heavy benchmarks (GSM8K, MATH, GPQA). If the gap is >5%, that suggests we're partially measuring reasoning extraction capability rather than answer quality.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.