RLTF (Reinforcement Learning from Text Feedback), 2026

The Problem Setup

Standard RL post-training gives the model a single scalar reward (0/1) per rollout — almost no information about what went wrong. The paper's insight: during training, you can also collect natural-language critiques (from humans, LLM judges, compilers, etc.) that say why the answer is wrong. But at test time, no critique is available — the model must get it right on the first try.

So the question is: how do you use training-time text feedback to improve single-turn (no-feedback) test performance?

The paper proposes two methods, both built on a two-turn rollout:

x₀ (prompt) → π → y₀ (first attempt) → r₀, c₀ (reward + text critique)
x₁ = [x₀, y₀, c₀] → π → y₁ (revised attempt) → r₁

Method 1: Self Distillation (RLTF-SD)

Idea: The model conditioned on feedback (π(·|x₁)) is a better policy than the model without it (π(·|x₀)). Use the feedback-conditioned model as a teacher to distill into the unconditioned model.

Concrete procedure:

Sample y₀ ~ π(·|x₀), get critique c₀, form x₁ = concat(x₀, y₀, c₀)
Sample revised answer y₁ ~ π(·|x₁)
Train π(·|x₀) to produce y₁ — i.e., compute loss on (x₀, y₁), not (x₁, y₁)

The objective (after their bias-variance analysis) is an advantage-weighted regression:

$$\nabla \ell_{\text{distill}} = \mathbb{E}_{y_1 \sim \pi(\cdot|x_1)}\big[A^{(0)}(y_1) \cdot \nabla \log \pi(y_1 | x_0)\big]$$

with the first-turn baseline:

$$A^{(0)}_i = R(x_0, y_1^i) - \frac{1}{N}\sum_j R(x_0, y_0^j)$$

Two key design choices (the paper's main technical contribution for SD):

Use first-turn rewards as the baseline, not second-turn. A GRPO-style baseline over {r₁} collapses: when feedback makes second-turn answers reliably correct (p₁ → 1), all advantages → 0 and there's no gradient — even though the first-turn policy is still bad. Using the mean of {r₀} as baseline keeps the gradient alive as long as the student is imperfect.
Drop importance weights entirely. The "correct" off-policy correction π(y₁|x₀)/π(y₁|x₁) is unbiased but has explosive variance over long token sequences. They find AWR (no importance weighting, accepting bias) beats both full IS and clipped IS (CISPO).