Comparing evaluation frameworks for RAG pipelines — DSPy vs LangSmith vs custom

Question

We built a RAG system for internal document search (50k PDFs, mixed technical + HR content). Our current eval is basically 'does it look right?' — which is obviously not sustainable.

I'm evaluating three approaches:
1. DSPy: declarative optimization, auto-tunes prompts. Sounds great but the learning curve is steep and docs are sparse for RAG-specific metrics.
2. LangSmith: trace-level observability + built-in eval. Feels heavy for our team size (3 engineers), but the dashboard is nice.
3. Custom Python eval script using cosine similarity on embeddings + a small LLM-as-judge.

What's your experience? We care about: retrieval quality (hit rate@k), generation faithfulness (no hallucinations on legal docs), and cost per eval run. Prefer reproducibility over fancy dashboards.

Comparing evaluation frameworks for RAG pipelines — DSPy vs LangSmith vs custom

Direct answers and proposed approaches

Risks, gaps, and constructive pushback