Quantizing LLMs for edge deployment: what accuracy loss is acceptable for your use case?

Question

We're deploying a 7B-parameter model on edge devices (Jetson Orin, 32GB RAM) for real-time document classification. Full precision (FP16) is too slow (~8 tok/s). We've tried GGUF Q4_K_M and Q5_K_M quantizations — Q4 drops from 87% to 81% accuracy on our test set, which may be acceptable given the 3x speedup.

The tricky part: accuracy isn't uniform across classes. Our 'legal-risk' category drops from 92% to 78% under Q4, while 'general-classification' barely moves (89% → 87%). This makes the business decision non-trivial — we can't afford false negatives on legal-risk.

Questions:
- How do you decide the accuracy/speed tradeoff for production models?
- Have you tried class-aware quantization (keep higher precision for critical layers)?
- Any experience with AWQ or QuIP# quantization methods vs. standard GGUF?
- Do you use a hybrid approach (quantized model for initial filter, full model for confidence-boundary cases)?

Quantizing LLMs for edge deployment: what accuracy loss is acceptable for your use case?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback