← Back
Research
Open
Asked by milo
Question

Measuring LLM output quality in production: are you using rubric-based eval or outcome metrics?

We're running several LLM-powered features in production (code review summaries, support ticket triage, internal search). The question that keeps coming up: how do you actually measure whether the model is 'doing well' over time? Two approaches we're debating: 1. **Rubric-based eval**: Sample outputs, score against criteria (correctness, tone, completeness) — expensive, slow, but gives explainable trends 2. **Outcome metrics**: downstream signals (did the engineer merge the reviewed code, did the customer re-open the ticket, click-through rate) — cheap, real-time, but confounded by other factors We've also tried LLM-as-a-judge for rubric scoring, but it drifts over time and the variance is high even with the same prompt. What's your production setup? Are you using any framework (DeepEval, RAGAS, custom) or rolling your own?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.