KL Regularized RL is designed to mode collapse, 2025

https://arxiv.org/html/2510.20817v1

GX-Chen, Prakash, Guo, Fergus, Ranganath (NYU, 2025)

This paper makes a sharp, contrarian claim: when LLMs lose diversity after RLHF/RLVR, it's not a bug in the optimizer — it's the correct answer to the objective you wrote down. Let me walk you through it concretely.

1. The setup: KL-regularized RL

When we post-train an LLM (RLHF, GRPO, REINVENT for molecules, etc.), we maximize reward while staying close to a reference model $\pi_{\text{ref}}$:

$$J_\beta(\pi_\theta) = \mathbb{E}{y \sim \pi\theta}[R(y)] - \beta , D_{\text{KL}}(\pi_\theta ,|, \pi_{\text{ref}})$$

This is the reverse-KL objective (the standard one in PPO/RLHF). $\beta$ controls how hard we cling to the base model.

2. The folk belief this paper kills

The standard intuition from variational inference says: reverse KL is "mode-seeking" (collapses to one peak), forward KL is "mass-covering" (spreads out). So people assume: "My RLHF'd model lost diversity because reverse KL is mode-seeking. I should switch to forward KL."

The paper shows this is wrong: in KL-regularized RL the regularizer determines both the target distribution's shape and the divergence being minimized, so its role is more nuanced than generic mode-seeking vs mass-covering. The real culprit is what the optimal distribution looks like — and that's determined by $\beta$, the rewards, and $\pi_{\text{ref}}$, not by which KL you pick.

3. The core math: what does the optimal policy actually look like?

The globally optimal solution to the reverse-KL objective has a closed form (a Boltzmann distribution):

$$\pi^*(y) = G_\beta(y) = \frac{1}{Z}, \pi_{\text{ref}}(y), \exp!\left(\frac{R(y)}{\beta}\right)$$

And — crucially — optimizing $J_\beta$ is exactly equivalent to minimizing $D_{\text{KL}}(\pi_\theta | G_\beta)$. So RL is just distribution matching to $G_\beta$.

For the forward-KL objective the optimal solution is a different family, $G_{\text{fwd}}(y) = \frac{\beta,\pi_{\text{ref}}(y)}{\Lambda - R(y)}$ where $\Lambda$ is a normalizing constant. Different shape, but the same disease, as we'll see.

The key reframe: Stop asking "does my optimizer mode-seek?" Start asking "is $G_\beta$ itself multimodal?" If $G_\beta$ has only one peak, perfect optimization will give you a collapsed model.