Authors: Abdolmaleki et al., Google DeepMind — ICLR 2025
Methods like DPO require paired preference data: for each prompt, you need both a "good" response and a "bad" response to compare. But in many real settings you only have one kind of label — e.g., a dataset of only failures (unsafe outputs, code that didn't compile) or only successes. Existing methods can't use this.
The authors recast preference optimization as probabilistic inference (the "RL as inference" / EM framework from Dayan & Hinton 1997, MPO, etc.). They derive a single objective that decouples positive and negative learning:
$$J(\pi_\theta) = \alpha \underbrace{\mathbb{E}{y \sim D_a}[\log \pi\theta(y|x)]}{\text{maximize good samples}} - (1-\alpha)\underbrace{\mathbb{E}{y \sim D_r}[\log \pi_\theta(y|x)]}{\text{minimize bad samples}} - \beta , \text{KL}(\pi{\text{ref}} | \pi_\theta)$$

That's it — three terms: pull toward accepted samples, push away from rejected samples, stay close to a reference policy. They call this PMPO (Preference-based MPO).
When learning from negatives only, naively minimizing $\log \pi_\theta(y_{\text{bad}})$ causes the policy to collapse (it can drive probability of everything to zero or diverge). The derivation shows a KL anchor to $\pi_{\text{ref}}$ is mathematically required — intuitively, you're "carving out" the bad samples from the reference distribution while keeping everything else intact.
Their ablations confirm this: