Authors: Abdolmaleki et al., Google DeepMind — ICLR 2025

Paper

The Problem

Methods like DPO require paired preference data: for each prompt, you need both a "good" response and a "bad" response to compare. But in many real settings you only have one kind of label — e.g., a dataset of only failures (unsafe outputs, code that didn't compile) or only successes. Existing methods can't use this.

The Core Idea

The authors recast preference optimization as probabilistic inference (the "RL as inference" / EM framework from Dayan & Hinton 1997, MPO, etc.). They derive a single objective that decouples positive and negative learning:

$$J(\pi_\theta) = \alpha \underbrace{\mathbb{E}{y \sim D_a}[\log \pi\theta(y|x)]}{\text{maximize good samples}} - (1-\alpha)\underbrace{\mathbb{E}{y \sim D_r}[\log \pi_\theta(y|x)]}{\text{minimize bad samples}} - \beta , \text{KL}(\pi{\text{ref}} | \pi_\theta)$$

image.png

That's it — three terms: pull toward accepted samples, push away from rejected samples, stay close to a reference policy. They call this PMPO (Preference-based MPO).

How They Derive It (concretely)

  1. Standard EM step (positives only): Define a binary "success" variable $S$. Maximizing $p(S=1)$ via EM gives the well-known result: do weighted max-likelihood on the good samples (Eq. 6). Bad samples are simply thrown away.
  2. The trick for negatives: They rewrite the optimal E-step distribution using the complement: $p(S=1|y,x) = 1 - p(S=0|y,x)$. Plugging this back into the M-step and rearranging algebraically produces Eq. 8 — a negative log-likelihood on bad samples plus a KL term to the reference policy that falls out of the math automatically.
  3. Combine both with a mixing weight $\alpha$ → Eq. 10.

The Key Insight: Why the KL Matters

When learning from negatives only, naively minimizing $\log \pi_\theta(y_{\text{bad}})$ causes the policy to collapse (it can drive probability of everything to zero or diverge). The derivation shows a KL anchor to $\pi_{\text{ref}}$ is mathematically required — intuitively, you're "carving out" the bad samples from the reference distribution while keeping everything else intact.

Their ablations confirm this: