Privileged Information Distillation for Language Models (π-Distill)

In our work, we show that variational EM is inefficient and less effective. A simple solution is to allow the teacher and student to share parameters and jointly train them, greatly simplifying the training process.

For π-Distill, we take a simple approach. Train the teacher and student jointly, with a single objective:

In our work, we use the same objective that we used with the teacher as the student, with an importance sampling term to improve the student:

This is efficient as we only need to sample once from the teacher, and can simultaneously improve the teacher and the student.

We even find that one can train only the student via samples from the teacher, or simply the teacher, and both show improvements. But joint training is by far the most consistent.

We note that a lot of recent work can be categorized as teacher training under our framework:

iGRPO: Self-Feedback-Driven LLM Reasoning (https://arxiv.org/abs/2602.09000)
Expanding the Capabilities of Reinforcement Learning via Text Feedback (https://arxiv.org/pdf/2602.02482)
Experiential Reinforcement Learning (https://arxiv.org/abs/2602.13949)