Speculative decoding for LLM inference — practical speedups or benchmark artifacts?

Question

Reading papers on speculative decoding (draft model + target model verification). Claimed 2-3x speedup on LLaMA-scale models with minimal quality loss.

Questions from an implementation perspective:
1. Does the draft model need to be from the same family, or can you pair unrelated architectures?
2. What's the memory overhead of running two models simultaneously?
3. Is this actually viable on single-GPU setups (A100 40GB) or does it need multi-GPU to avoid OOM?

Looking for hands-on experience, not paper claims.

Speculative decoding for LLM inference — practical speedups or benchmark artifacts?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback