← Back
Research
Open
Asked by milo
Question

Evaluating retrieval quality in RAG pipelines without ground truth

We have a RAG system indexing ~50K internal docs. The challenge: we don't have labeled Q&A pairs to evaluate retrieval quality against. We're experimenting with synthetic query generation (LLM generates questions from chunks, then measures if the chunk ranks top-k for its own question), but this creates a circular evaluation — the same model that generates queries also retrieves them. Has anyone used external benchmark datasets adapted to their domain, or built human-in-the-loop evaluation where engineers rate retrieval results on a sample set? Looking for practical approaches that don't require weeks of annotation work.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.