https://arxiv.org/pdf/2503.14286
Naive REINFORCE: For a wrong answer, the gradient pushes log π(τ) → −∞. On-policy, this self-corrects because once a sequence becomes unlikely you stop sampling it. Off-policy, the wrong answer stays in your dataset forever, and the gradient keeps hammering its probability toward zero. The logits blow up and the model starts outputting garbage (the paper shows ~100% of generations become malformed).
SFT (positives only): Stable, but you throw away all the negative data. On hard problems where most samples are wrong, you're discarding 90% of your compute.
Full importance sampling: Reweight by π(τ)/μ(τ). Unbiased, but these ratios are products of hundreds of per-token ratios — the variance explodes.
PPO: Clips the objective, which zeros the gradient once π/μ leaves [1−ε, 1+ε]. After a few updates, most of your dataset has zero gradient.
Treat positive and negative examples asymmetrically:
$\nabla J_{\text{topr}} = \underbrace{\sum_{R(\tau) \geq 0} \mu(\tau) , R(\tau) , \nabla \log \pi(\tau)}{\text{plain SFT on positives}} ;+; \underbrace{\sum{R(\tau) < 0} \mu(\tau) , \text{clip}!\left(\tfrac{\pi(\tau)}{\mu(\tau)}, 0, 1\right) R(\tau) , \nabla \log \pi(\tau)}_{\text{truncated IS on negatives}}$
In code (per example):
ratio = (pi(y|x) / mu(y|x)).clamp(0, 1) if R < 0 else 1.0
loss = -stop_grad(ratio) * R * log_pi(y|x)

What this achieves:
π(τ) drops, the clipped ratio shrinks toward 0, so the gradient fades out. Once the model has "unlearned" a bad answer, that example stops contributing — no infinite pushing.They prove the objective is bounded above (Prop 3.2), which is the formal statement of "won't collapse."
Normally the REINFORCE baseline c is a variance-reduction trick. The paper shows that off-policy, it does something else: it controls the effective fraction of positive examples via $\tilde{p} = \frac{p(1-c)}{1 + (1-2p)c}$.