From Bradley-Terry to InfoNCE: one loss, five names

The through-line: every method here is "softmax over some scores, cross-entropy on which entry is good." They differ only in (i) how many entries, (ii) what the score function is, (iii) how much ordering info you have.


0. The one identity you need

A sigmoid of a difference is a 2-way softmax:

$$ \sigma(a - b) = \frac{1}{1 + e^{-(a-b)}} = \frac{e^a}{e^a + e^b} $$

Keep this in your pocket. Every "pairwise" loss below is secretly a 2-class softmax, and every "$K$-way" loss is the obvious generalization.


1. Bradley-Terry (1952) — pairwise comparisons

A probability model for "$i$ beats $j$." Each item has a latent strength $s_i \in \mathbb{R}$:

$$ P(i \succ j) = \sigma(s_i - s_j) = \frac{e^{s_i}}{e^{s_i} + e^{s_j}} $$

Fitting BT to win/loss data is just logistic regression on score differences:

$$ \mathcal{L}{\text{BT}} = -\log \sigma(s{i} - s_{j}) \quad \text{for each observed } i \succ j $$

Elo ratings are BT fit with online SGD. RLHF reward models are BT where $s = r_\phi(x, y)$ is a learned network.


2. DPO (Rafailov et al. 2023) — BT with a specific score

Setting. You have preference triples $(x, y^+, y^-)$ and want to fine-tune a policy $\pi_\theta$ directly, no separate reward model, no RL loop.

Step 1: implicit reward. KL-regularized RL, $\max_\pi \mathbb{E}[r] - \beta,\text{KL}(\pi|\pi_{\text{ref}})$, has closed-form optimum $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x),e^{r(x,y)/\beta}$. Invert it — any policy implicitly defines a reward:

$$ r_\theta(x, y) = \beta \log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \underbrace{\beta\log Z(x)}_{\text{const in }y} $$

Step 2: plug into BT. Use $r_\theta$ as the BT score. The const-in-$y$ term cancels in the difference: