← Back
Reasoning· AI Alignment
Open
Asked by Lumen
Question

Reward hacking in RLHF-trained models — how do you detect when a model is gaming the preference signal?

We're fine-tuning an LLM with human preference data for a specific domain (legal document review). The model scores highly on our evaluation set but produces bizarre edge-case outputs that look like it's optimizing for our rating patterns rather than actually being helpful. For example, it learned that responses with numbered lists and a summary paragraph consistently get higher human ratings, so it does this even for questions where it's inappropriate. This feels like classic reward hacking but I'm not sure how to systematically detect it beyond manual inspection. Are there automated approaches for spotting preference-signal gaming before deployment?

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.