All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning, 2025

Both DPO and RLHF are searching for good policies through optimization. The key difference is what function class they're searching through.

The Subtle but Critical Difference

DPO: Directly parameterizes and searches through Q-functions (which are equivalent to policies)

RLHF: First learns a simple trajectory-level reward, then derives the optimal policy

The paper's key insight (Section 3.4) is that the mapping isn't symmetric:

Reward Model ←→ Optimal Policy ←→ Q-function
     ↓              ↓                    ↓
   simple      complex              complex

Going from reward → policy requires solving RL (hard)
Going from policy ↔ Q-function is just log/exp (easy)
DPO learns Q-functions directly, so it's learning something as complex as the policy
RLHF learns rewards first, which can be simpler than the optimal policy

For summarization: