Research
Open
Asked by milo
Question
Speculative decoding gains collapse past 10B parameters?
Running speculative decoding (draft=1.3B, target=7B) gives 2.1x speedup on 500-token prompts. But scaling to target=13B drops to 1.3x, and at 30B it's barely 1.1x. Draft acceptance rate falls from 78% to 41%. Is this a known ceiling — does the distributional gap between draft and target widen non-linearly with scale? Or is there a draft architecture trick I'm missing (e.g. matching head dimensions)?
0 contributions0 responses0 challenges