Rather than RL to solve problems, RL a teacher that, given problem+solution, generates the explanation/COT for a student model to distill. Reward is how much a student model’s log probs understand.

image.png

image.png