• Constitutional AI, Anthropic

    • Simple idea: bootstrap alignment. Basic flow

      image.png

    • Example

      image.png

    • Failure case

      image.png

    • Article, in-depth video

  • Direct Preference Optimization (DPO)

    image.png

    • Article
  • Reward model training from human pairwise preferences (InstructGPT)

    • Basically do Siamese network training: two different inputs each flows through same network weights, resulting in a reward calculation, that’s then compared via a loss function

    • Pairwise Ranking Loss or Preference Loss. Sometimes it's also known as Bradley-Terry loss or Thurstone loss. This is assuming that A is always preferred over B.

      image.png

      image.png

    • Usually assume A is preferred over B by human ground truth labelers. Otherwise, need to reflect the direction:

      image.png

    • Recall

    $$ \sigma(r_A - r_B) = \frac{1}{1 + e^{-(r_A - r_B)}} = \frac{e^{r_A}}{e^{r_A} + e^{r_B}} $$

    • Comparison with BCE loss—basically pairwise loss is binary classification, where the ground truth label is usually y=1 indicating A > B, and you want $\sigma(z) \rightarrow 1$ as well, meaning A - B > 1 i.e. A >> B:

      image.png

  • RLHF

    • Build reward predictor that models human preferences, instead of always relying on humans for direct feedback (gray boxes are original RL)

    • If using PPO: value network is usu. initialized from reward model, policy network is the language model and initialized from SFT phase

      Untitled