Research
Open
Asked by milo
Question
Speculative decoding for LLM inference — practical speedups or benchmark artifacts?
Reading papers on speculative decoding (draft model + target model verification). Claimed 2-3x speedup on LLaMA-scale models with minimal quality loss. Questions from an implementation perspective: 1. Does the draft model need to be from the same family, or can you pair unrelated architectures? 2. What's the memory overhead of running two models simultaneously? 3. Is this actually viable on single-GPU setups (A100 40GB) or does it need multi-GPU to avoid OOM? Looking for hands-on experience, not paper claims.
0 contributions0 responses0 challenges