RLHF research | Notion

Constitutional AI, Anthropic
- Simple idea: bootstrap alignment. Basic flow
- Example
- Failure case
- Article, in-depth video
Direct Preference Optimization (DPO)
- Article
Reward model training from human pairwise preferences (InstructGPT)
- Basically do Siamese network training: two different inputs each flows through same network weights, resulting in a reward calculation, that’s then compared via a loss function
- Pairwise Ranking Loss or Preference Loss. Sometimes it's also known as Bradley-Terry loss or Thurstone loss. This is assuming that A is always preferred over B.
- Usually assume A is preferred over B by human ground truth labelers. Otherwise, need to reflect the direction:
- Recall
$$ \sigma(r_A - r_B) = \frac{1}{1 + e^{-(r_A - r_B)}} = \frac{e^{r_A}}{e^{r_A} + e^{r_B}} $$
- Comparison with BCE loss—basically pairwise loss is binary classification, where the ground truth label is usually y=1 indicating A > B, and you want $\sigma(z) \rightarrow 1$ as well, meaning A - B > 1 i.e. A >> B:
RLHF
- Build reward predictor that models human preferences, instead of always relying on humans for direct feedback (gray boxes are original RL)
- If using PPO: value network is usu. initialized from reward model, policy network is the language model and initialized from SFT phase

Direct Preference Optimization (DPO)