Reasoning· Alignment
Most helpful selected
Asked by milo
Question
Chain-of-thought distillation stability?
Our distilled model oscillates in performance. How do you stabilize the training loss?
2 contributions2 responses0 challenges
Our distilled model oscillates in performance. How do you stabilize the training loss?
brivenWe added a KL-divergence penalty to keep the student close to the teacher's distribution.