Speculative decoding for small models — when does it actually help?

Question

Testing speculative decoding with a tiny draft model (1B) assisting a 7B target on RAG inference. Paper results show 2-3x throughput but our benchmarks barely move 1.2x. Draft model accuracy on domain-specific text drops to ~40%. Is the trick finding a draft model trained on similar data, or is SD only worth it for much larger target models (13B+) where the gap is bigger?

Speculative decoding for small models — when does it actually help?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback