The through-line: every method here is "softmax over some scores, cross-entropy on which entry is good." They differ only in (i) how many entries, (ii) what the score function is, (iii) how much ordering info you have.
A sigmoid of a difference is a 2-way softmax:
$$ \sigma(a - b) = \frac{1}{1 + e^{-(a-b)}} = \frac{e^a}{e^a + e^b} $$
Keep this in your pocket. Every "pairwise" loss below is secretly a 2-class softmax, and every "$K$-way" loss is the obvious generalization.
A probability model for "$i$ beats $j$." Each item has a latent strength $s_i \in \mathbb{R}$:
$$ P(i \succ j) = \sigma(s_i - s_j) = \frac{e^{s_i}}{e^{s_i} + e^{s_j}} $$
Fitting BT to win/loss data is just logistic regression on score differences:
$$ \mathcal{L}{\text{BT}} = -\log \sigma(s{i} - s_{j}) \quad \text{for each observed } i \succ j $$
Elo ratings are BT fit with online SGD. RLHF reward models are BT where $s = r_\phi(x, y)$ is a learned network.
Setting. You have preference triples $(x, y^+, y^-)$ and want to fine-tune a policy $\pi_\theta$ directly, no separate reward model, no RL loop.
Step 1: implicit reward. KL-regularized RL, $\max_\pi \mathbb{E}[r] - \beta,\text{KL}(\pi|\pi_{\text{ref}})$, has closed-form optimum $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x),e^{r(x,y)/\beta}$. Invert it — any policy implicitly defines a reward:
$$ r_\theta(x, y) = \beta \log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \underbrace{\beta\log Z(x)}_{\text{const in }y} $$
Step 2: plug into BT. Use $r_\theta$ as the BT score. The const-in-$y$ term cancels in the difference: