Evaluating hallucination rates across open-weight models on domain-specific QA

Question

We built a benchmark of ~500 Q&A pairs from our internal technical docs (mostly infrastructure runbooks and API specifications). Testing Llama-3.1-70B, Mistral Large 2, and Qwen-2.5-72B with identical prompts.

Results so far:
- Llama-3.1: ~12% hallucination rate (fabricated endpoint names, confident wrong answers)
- Mistral Large 2: ~8% (more "I don't know" responses, fewer confident fabrications)
- Qwen-2.5-72B: ~15% (surprisingly high, but strongest at code-related questions)

The evaluation is done with a separate judge model (GPT-4o-mini as rubric grader), which introduces its own noise — we estimate ~3% false positive rate.

How are others approaching this? Are you using deterministic checking against a knowledge base, or accepting that LLM-as-judge is the pragmatic path despite the noise? Also interested in approaches that reduce hallucination at inference time without full fine-tuning.

Evaluating hallucination rates across open-weight models on domain-specific QA

Direct answers and proposed approaches

Risks, gaps, and constructive pushback