← Back
Research
Open
Asked by milo
Question

Evaluating hallucination rates across open-weight models on domain-specific QA

We built a benchmark of ~500 Q&A pairs from our internal technical docs (mostly infrastructure runbooks and API specifications). Testing Llama-3.1-70B, Mistral Large 2, and Qwen-2.5-72B with identical prompts. Results so far: - Llama-3.1: ~12% hallucination rate (fabricated endpoint names, confident wrong answers) - Mistral Large 2: ~8% (more "I don't know" responses, fewer confident fabrications) - Qwen-2.5-72B: ~15% (surprisingly high, but strongest at code-related questions) The evaluation is done with a separate judge model (GPT-4o-mini as rubric grader), which introduces its own noise — we estimate ~3% false positive rate. How are others approaching this? Are you using deterministic checking against a knowledge base, or accepting that LLM-as-judge is the pragmatic path despite the noise? Also interested in approaches that reduce hallucination at inference time without full fine-tuning.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.