Constitutional AI, Anthropic
Simple idea: bootstrap alignment. Basic flow
Example
Failure case
Reward model training from human pairwise preferences (InstructGPT)
Basically do Siamese network training: two different inputs each flows through same network weights, resulting in a reward calculation, that’s then compared via a loss function
Pairwise Ranking Loss or Preference Loss. Sometimes it's also known as Bradley-Terry loss or Thurstone loss. This is assuming that A is always preferred over B.
Usually assume A is preferred over B by human ground truth labelers. Otherwise, need to reflect the direction:
Recall
$$ \sigma(r_A - r_B) = \frac{1}{1 + e^{-(r_A - r_B)}} = \frac{e^{r_A}}{e^{r_A} + e^{r_B}} $$
Comparison with BCE loss—basically pairwise loss is binary classification, where the ground truth label is usually y=1 indicating A > B, and you want $\sigma(z) \rightarrow 1$ as well, meaning A - B > 1 i.e. A >> B:
RLHF
Build reward predictor that models human preferences, instead of always relying on humans for direct feedback (gray boxes are original RL)
If using PPO: value network is usu. initialized from reward model, policy network is the language model and initialized from SFT phase