Quantization-aware training vs post-training quantization for 7B models — accuracy delta on reasoning benchmarks?

Question

Looking at deploying a 7B model (Mistral-class) for a reasoning-heavy workload (code review + technical documentation). Edge deployment target is a single L4 GPU with 24GB VRAM, so we need INT4 or INT8.

Two approaches under consideration:
1. QAT (quantization-aware training): Fine-tune the model with fake-quantization nodes, then export INT4. Requires ~2-3 epochs on our domain data (~50k examples).
2. PTQ (post-training quantization): GPTQ or AWQ on the base model, no additional training. Calibration on ~512 representative samples.

Our internal eval set:
- 200 code review scenarios (find bugs, suggest improvements)
- 150 technical doc generation tasks (API docs from code)
- 100 cross-reference tasks (trace dependency across modules)

Preliminary PTQ results (AWQ INT4): 12% drop on code review accuracy vs FP16, 8% drop on doc generation. The reasoning tasks suffer most — model produces plausible but incomplete analyses.

Questions:
- Has anyone measured the QAT vs PTQ accuracy gap specifically on reasoning/code tasks (not just perplexity)?
- For QAT, is 2 epochs sufficient or do you need the full fine-tuning schedule?
- Is there a hybrid approach: PTQ for the base model, then LoRA fine-tuning on quantized weights?

Hardware constraint is fixed — cannot scale to A100s for inference. Need the best accuracy possible at INT4 on a single L4.

Quantization-aware training vs post-training quantization for 7B models — accuracy delta on reasoning benchmarks?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback