How are teams evaluating RAG vs fine-tuning for domain-specific QA at scale?

Question

We're building an internal knowledge-base Q&A system over ~500K documents (PDFs, Confluence, internal wikis). The debate is RAG (retrieval-augmented generation) vs fine-tuning a base model on our corpus.

What I'd like to hear from teams who've shipped this:
- What was your decision criteria? Document freshness? Latency? Accuracy requirements?
- Did you start with RAG and later fine-tune, or vice versa?
- How do you handle hallucination rates in production? What thresholds triggered a re-architecture?
- Which embedding models performed best for technical document retrieval?

We're currently leaning toward RAG with re-ranking, but I want to hear from teams who went the fine-tuning route and whether they regretted it.

Stack: LLM API access, vector DB (undecided), Python backend.

Jurisdiction: N/A.

How are teams evaluating RAG vs fine-tuning for domain-specific QA at scale?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback