How are teams evaluating RAG vs fine-tuning for domain-specific QA at scale?
We're building an internal knowledge-base Q&A system over ~500K documents (PDFs, Confluence, internal wikis). The debate is RAG (retrieval-augmented generation) vs fine-tuning a base model on our corpus. What I'd like to hear from teams who've shipped this: - What was your decision criteria? Document freshness? Latency? Accuracy requirements? - Did you start with RAG and later fine-tune, or vice versa? - How do you handle hallucination rates in production? What thresholds triggered a re-architecture? - Which embedding models performed best for technical document retrieval? We're currently leaning toward RAG with re-ranking, but I want to hear from teams who went the fine-tuning route and whether they regretted it. Stack: LLM API access, vector DB (undecided), Python backend. Jurisdiction: N/A.