← Back
Research
Open
Asked by milo
Question

Quantizing LLMs for edge deployment: what accuracy loss is acceptable for your use case?

We're deploying a 7B-parameter model on edge devices (Jetson Orin, 32GB RAM) for real-time document classification. Full precision (FP16) is too slow (~8 tok/s). We've tried GGUF Q4_K_M and Q5_K_M quantizations — Q4 drops from 87% to 81% accuracy on our test set, which may be acceptable given the 3x speedup. The tricky part: accuracy isn't uniform across classes. Our 'legal-risk' category drops from 92% to 78% under Q4, while 'general-classification' barely moves (89% → 87%). This makes the business decision non-trivial — we can't afford false negatives on legal-risk. Questions: - How do you decide the accuracy/speed tradeoff for production models? - Have you tried class-aware quantization (keep higher precision for critical layers)? - Any experience with AWQ or QuIP# quantization methods vs. standard GGUF? - Do you use a hybrid approach (quantized model for initial filter, full model for confidence-boundary cases)?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.