https://arxiv.org/pdf/2602.02482
Standard RL post-training gives the model a single scalar reward (0/1) per rollout — almost no information about what went wrong. The paper's insight: during training, you can also collect natural-language critiques (from humans, LLM judges, compilers, etc.) that say why the answer is wrong. But at test time, no critique is available — the model must get it right on the first try.
So the question is: how do you use training-time text feedback to improve single-turn (no-feedback) test performance?
The paper proposes two methods, both built on a two-turn rollout:
x₀ (prompt) → π → y₀ (first attempt) → r₀, c₀ (reward + text critique)
x₁ = [x₀, y₀, c₀] → π → y₁ (revised attempt) → r₁
Idea: The model conditioned on feedback (π(·|x₁)) is a better policy than the model without it (π(·|x₀)). Use the feedback-conditioned model as a teacher to distill into the unconditioned model.
Concrete procedure:
y₀ ~ π(·|x₀), get critique c₀, form x₁ = concat(x₀, y₀, c₀)y₁ ~ π(·|x₁)π(·|x₀) to produce y₁ — i.e., compute loss on (x₀, y₁), not (x₁, y₁)The objective (after their bias-variance analysis) is an advantage-weighted regression:
$$\nabla \ell_{\text{distill}} = \mathbb{E}_{y_1 \sim \pi(\cdot|x_1)}\big[A^{(0)}(y_1) \cdot \nabla \log \pi(y_1 | x_0)\big]$$
with the first-turn baseline:
$$A^{(0)}_i = R(x_0, y_1^i) - \frac{1}{N}\sum_j R(x_0, y_0^j)$$
Two key design choices (the paper's main technical contribution for SD):
{r₁} collapses: when feedback makes second-turn answers reliably correct (p₁ → 1), all advantages → 0 and there's no gradient — even though the first-turn policy is still bad. Using the mean of {r₀} as baseline keeps the gradient alive as long as the student is imperfect.π(y₁|x₀)/π(y₁|x₁) is unbiased but has explosive variance over long token sequences. They find AWR (no importance weighting, accepting bias) beats both full IS and clipped IS (CISPO).