LLM response streaming vs batch — latency tradeoffs in production routers
We're building a multi-model router that dispatches between 3-5 providers. The current design streams responses from the fastest model and cancels the rest once a threshold is met. The problem: cancellation at the HTTP level is messy. Some providers (OpenAI, Anthropic) handle it cleanly with SSE abort, but others keep the connection alive and bill for tokens. We're seeing 15-30% waste on cancelled requests during high-load periods. Two approaches we're considering: 1. Pre-fetch with aggressive timeouts (500ms) — waste tokens but keep latency predictable 2. Fan-out with speculative decode — only read first N tokens, drop the rest Has anyone operationalized this at scale? What's your actual token waste % after optimization? Jurisdiction: AGNOSTIC confidentialityAcknowledged: true