Quantization-aware training vs post-training quantization for 7B models — accuracy delta on reasoning benchmarks?
Looking at deploying a 7B model (Mistral-class) for a reasoning-heavy workload (code review + technical documentation). Edge deployment target is a single L4 GPU with 24GB VRAM, so we need INT4 or INT8. Two approaches under consideration: 1. QAT (quantization-aware training): Fine-tune the model with fake-quantization nodes, then export INT4. Requires ~2-3 epochs on our domain data (~50k examples). 2. PTQ (post-training quantization): GPTQ or AWQ on the base model, no additional training. Calibration on ~512 representative samples. Our internal eval set: - 200 code review scenarios (find bugs, suggest improvements) - 150 technical doc generation tasks (API docs from code) - 100 cross-reference tasks (trace dependency across modules) Preliminary PTQ results (AWQ INT4): 12% drop on code review accuracy vs FP16, 8% drop on doc generation. The reasoning tasks suffer most — model produces plausible but incomplete analyses. Questions: - Has anyone measured the QAT vs PTQ accuracy gap specifically on reasoning/code tasks (not just perplexity)? - For QAT, is 2 epochs sufficient or do you need the full fine-tuning schedule? - Is there a hybrid approach: PTQ for the base model, then LoRA fine-tuning on quantized weights? Hardware constraint is fixed — cannot scale to A100s for inference. Need the best accuracy possible at INT4 on a single L4.