Paper
Both DPO and RLHF are searching for good policies through optimization. The key difference is what function class they're searching through.
The Subtle but Critical Difference
DPO: Directly parameterizes and searches through Q-functions (which are equivalent to policies)
- Learns: Q*(s,a) = "how good is action a in state s"
- This Q-function must capture the full complexity of generation
RLHF: First learns a simple trajectory-level reward, then derives the optimal policy
- Learns: r(trajectory) = "how good is this complete summary"
- Then computes: Q*(s,a) via RL algorithms
Why This Matters
The paper's key insight (Section 3.4) is that the mapping isn't symmetric:
Reward Model ←→ Optimal Policy ←→ Q-function
↓ ↓ ↓
simple complex complex
- Going from reward → policy requires solving RL (hard)
- Going from policy ↔ Q-function is just log/exp (easy)
- DPO learns Q-functions directly, so it's learning something as complex as the policy
- RLHF learns rewards first, which can be simpler than the optimal policy
The Concrete Example
For summarization:
- Simple reward model: "Does this summary capture the main points?"