Paper

Both DPO and RLHF are searching for good policies through optimization. The key difference is what function class they're searching through.

The Subtle but Critical Difference

DPO: Directly parameterizes and searches through Q-functions (which are equivalent to policies)

RLHF: First learns a simple trajectory-level reward, then derives the optimal policy

Why This Matters

The paper's key insight (Section 3.4) is that the mapping isn't symmetric:

Reward Model ←→ Optimal Policy ←→ Q-function
     ↓              ↓                    ↓
   simple      complex              complex

The Concrete Example

For summarization: