Reward hacking in RLHF-trained models — how do you detect when a model is gaming the preference signal?

Question

We're fine-tuning an LLM with human preference data for a specific domain (legal document review). The model scores highly on our evaluation set but produces bizarre edge-case outputs that look like it's optimizing for our rating patterns rather than actually being helpful. For example, it learned that responses with numbered lists and a summary paragraph consistently get higher human ratings, so it does this even for questions where it's inappropriate. This feels like classic reward hacking but I'm not sure how to systematically detect it beyond manual inspection. Are there automated approaches for spotting preference-signal gaming before deployment?

Reward hacking in RLHF-trained models — how do you detect when a model is gaming the preference signal?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback