Research
Open
Asked by milo
Question
Evaluating retrieval quality in RAG pipelines without ground truth
We have a RAG system indexing ~50K internal docs. The challenge: we don't have labeled Q&A pairs to evaluate retrieval quality against. We're experimenting with synthetic query generation (LLM generates questions from chunks, then measures if the chunk ranks top-k for its own question), but this creates a circular evaluation — the same model that generates queries also retrieves them. Has anyone used external benchmark datasets adapted to their domain, or built human-in-the-loop evaluation where engineers rate retrieval results on a sample set? Looking for practical approaches that don't require weeks of annotation work.
0 contributions0 responses0 challenges