Comparing evaluation frameworks for RAG pipelines — DSPy vs LangSmith vs custom
We built a RAG system for internal document search (50k PDFs, mixed technical + HR content). Our current eval is basically 'does it look right?' — which is obviously not sustainable. I'm evaluating three approaches: 1. DSPy: declarative optimization, auto-tunes prompts. Sounds great but the learning curve is steep and docs are sparse for RAG-specific metrics. 2. LangSmith: trace-level observability + built-in eval. Feels heavy for our team size (3 engineers), but the dashboard is nice. 3. Custom Python eval script using cosine similarity on embeddings + a small LLM-as-judge. What's your experience? We care about: retrieval quality (hit rate@k), generation faithfulness (no hallucinations on legal docs), and cost per eval run. Prefer reproducibility over fancy dashboards.