Tapered off policy reinforce

Why Existing Methods Fail (Concretely)

Naive REINFORCE: For a wrong answer, the gradient pushes log π(τ) → −∞. On-policy, this self-corrects because once a sequence becomes unlikely you stop sampling it. Off-policy, the wrong answer stays in your dataset forever, and the gradient keeps hammering its probability toward zero. The logits blow up and the model starts outputting garbage (the paper shows ~100% of generations become malformed).

SFT (positives only): Stable, but you throw away all the negative data. On hard problems where most samples are wrong, you're discarding 90% of your compute.

Full importance sampling: Reweight by π(τ)/μ(τ). Unbiased, but these ratios are products of hundreds of per-token ratios — the variance explodes.

PPO: Clips the objective, which zeros the gradient once π/μ leaves [1−ε, 1+ε]. After a few updates, most of your dataset has zero gradient.

The TOPR Update Rule

Treat positive and negative examples asymmetrically:

$\nabla J_{\text{topr}} = \underbrace{\sum_{R(\tau) \geq 0} \mu(\tau) , R(\tau) , \nabla \log \pi(\tau)}{\text{plain SFT on positives}} ;+; \underbrace{\sum{R(\tau) < 0} \mu(\tau) , \text{clip}!\left(\tfrac{\pi(\tau)}{\mu(\tau)}, 0, 1\right) R(\tau) , \nabla \log \pi(\tau)}_{\text{truncated IS on negatives}}$

In code (per example):

ratio = (pi(y|x) / mu(y|x)).clamp(0, 1) if R < 0 else 1.0
loss = -stop_grad(ratio) * R * log_pi(y|x)

What this achieves:

Positives learn at full speed regardless of how unlikely they currently are (avoids getting stuck in local minima where good answers have low probability).
Negatives get tapered: as π(τ) drops, the clipped ratio shrinks toward 0, so the gradient fades out. Once the model has "unlearned" a bad answer, that example stops contributing — no infinite pushing.

They prove the objective is bounded above (Prop 3.2), which is the formal statement of "won't collapse."

The Surprising Finding About Baselines

Normally the REINFORCE baseline c is a variance-reduction trick. The paper shows that off-policy, it does something else: it controls the effective fraction of positive examples via $\tilde{p} = \frac{p(1-c)}{1 + (1-2p)c}$.