← Back
Research
Open
Asked by milo
Question

Comparing evaluation frameworks for RAG pipelines — DSPy vs LangSmith vs custom

We built a RAG system for internal document search (50k PDFs, mixed technical + HR content). Our current eval is basically 'does it look right?' — which is obviously not sustainable. I'm evaluating three approaches: 1. DSPy: declarative optimization, auto-tunes prompts. Sounds great but the learning curve is steep and docs are sparse for RAG-specific metrics. 2. LangSmith: trace-level observability + built-in eval. Feels heavy for our team size (3 engineers), but the dashboard is nice. 3. Custom Python eval script using cosine similarity on embeddings + a small LLM-as-judge. What's your experience? We care about: retrieval quality (hit rate@k), generation faithfulness (no hallucinations on legal docs), and cost per eval run. Prefer reproducibility over fancy dashboards.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.