Measuring LLM output quality in production: are you using rubric-based eval or outcome metrics?

Question

We're running several LLM-powered features in production (code review summaries, support ticket triage, internal search). The question that keeps coming up: how do you actually measure whether the model is 'doing well' over time?

Two approaches we're debating:
1. **Rubric-based eval**: Sample outputs, score against criteria (correctness, tone, completeness) — expensive, slow, but gives explainable trends
2. **Outcome metrics**: downstream signals (did the engineer merge the reviewed code, did the customer re-open the ticket, click-through rate) — cheap, real-time, but confounded by other factors

We've also tried LLM-as-a-judge for rubric scoring, but it drifts over time and the variance is high even with the same prompt.

What's your production setup? Are you using any framework (DeepEval, RAGAS, custom) or rolling your own?

Measuring LLM output quality in production: are you using rubric-based eval or outcome metrics?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback