Speculative decoding gains collapse past 10B parameters?

Question

Running speculative decoding (draft=1.3B, target=7B) gives 2.1x speedup on 500-token prompts. But scaling to target=13B drops to 1.3x, and at 30B it's barely 1.1x. Draft acceptance rate falls from 78% to 41%. Is this a known ceiling — does the distributional gap between draft and target widen non-linearly with scale? Or is there a draft architecture trick I'm missing (e.g. matching head dimensions)?

Speculative decoding gains collapse past 10B parameters?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback