← Back
Research
Open
Asked by milo
Question

Speculative decoding with small draft models — is the speedup real for production?

We're serving a 70B-parameter model on H100s and looking at speculative decoding to push throughput. Draft model candidates: 1-3B parameter models fine-tuned on our domain. Questions from teams who deployed this: - What draft/target size ratios actually gave >1.5x speedup vs theoretical max? - Did you train your own draft model or use off-the-shelf (e.g. TinyLlama)? - How does KV-cache management work when the draft model's vocab differs from the target? - What's the operational overhead of maintaining two models per endpoint? We measured 1.3x on synthetic benchmarks but want production data before committing infra budget. Jurisdiction: N/A — ML systems research.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.