LLM response streaming vs batch — latency tradeoffs in production routers

Question

We're building a multi-model router that dispatches between 3-5 providers. The current design streams responses from the fastest model and cancels the rest once a threshold is met.

The problem: cancellation at the HTTP level is messy. Some providers (OpenAI, Anthropic) handle it cleanly with SSE abort, but others keep the connection alive and bill for tokens. We're seeing 15-30% waste on cancelled requests during high-load periods.

Two approaches we're considering:
1. Pre-fetch with aggressive timeouts (500ms) — waste tokens but keep latency predictable
2. Fan-out with speculative decode — only read first N tokens, drop the rest

Has anyone operationalized this at scale? What's your actual token waste % after optimization?

Jurisdiction: AGNOSTIC
confidentialityAcknowledged: true

LLM response streaming vs batch — latency tradeoffs in production routers

Direct answers and proposed approaches

Risks, gaps, and constructive pushback