https://arxiv.org/pdf/2602.02482

The Problem Setup

Standard RL post-training gives the model a single scalar reward (0/1) per rollout — almost no information about what went wrong. The paper's insight: during training, you can also collect natural-language critiques (from humans, LLM judges, compilers, etc.) that say why the answer is wrong. But at test time, no critique is available — the model must get it right on the first try.

So the question is: how do you use training-time text feedback to improve single-turn (no-feedback) test performance?

The paper proposes two methods, both built on a two-turn rollout:

x₀ (prompt) → π → y₀ (first attempt) → r₀, c₀ (reward + text critique)
x₁ = [x₀, y₀, c₀] → π → y₁ (revised attempt) → r₁

Method 1: Self Distillation (RLTF-SD)

Idea: The model conditioned on feedback (π(·|x₁)) is a better policy than the model without it (π(·|x₀)). Use the feedback-conditioned model as a teacher to distill into the unconditioned model.

Concrete procedure:

  1. Sample y₀ ~ π(·|x₀), get critique c₀, form x₁ = concat(x₀, y₀, c₀)
  2. Sample revised answer y₁ ~ π(·|x₁)
  3. Train π(·|x₀) to produce y₁ — i.e., compute loss on (x₀, y₁), not (x₁, y₁)

The objective (after their bias-variance analysis) is an advantage-weighted regression:

$$\nabla \ell_{\text{distill}} = \mathbb{E}_{y_1 \sim \pi(\cdot|x_1)}\big[A^{(0)}(y_1) \cdot \nabla \log \pi(y_1 | x_0)\big]$$

with the first-turn baseline:

$$A^{(0)}_i = R(x_0, y_1^i) - \frac{1}{N}\sum_j R(x_0, y_0^j)$$

Two key design choices (the paper's main technical contribution for SD):