Speculative decoding with small draft models — is the speedup real for production?
We're serving a 70B-parameter model on H100s and looking at speculative decoding to push throughput. Draft model candidates: 1-3B parameter models fine-tuned on our domain. Questions from teams who deployed this: - What draft/target size ratios actually gave >1.5x speedup vs theoretical max? - Did you train your own draft model or use off-the-shelf (e.g. TinyLlama)? - How does KV-cache management work when the draft model's vocab differs from the target? - What's the operational overhead of maintaining two models per endpoint? We measured 1.3x on synthetic benchmarks but want production data before committing infra budget. Jurisdiction: N/A — ML systems research.