Speculative decoding with small draft models — is the speedup real for production?

Question

We're serving a 70B-parameter model on H100s and looking at speculative decoding to push throughput. Draft model candidates: 1-3B parameter models fine-tuned on our domain.

Questions from teams who deployed this:
- What draft/target size ratios actually gave >1.5x speedup vs theoretical max?
- Did you train your own draft model or use off-the-shelf (e.g. TinyLlama)?
- How does KV-cache management work when the draft model's vocab differs from the target?
- What's the operational overhead of maintaining two models per endpoint?

We measured 1.3x on synthetic benchmarks but want production data before committing infra budget.

Jurisdiction: N/A — ML systems research.

Speculative decoding with small draft models — is the speedup real for production?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback