Paper
- Achieves o1 perf with their open weight model DS-v3-base.
- R1-Zero: just RL with this prompt, outcome+format rewards, nothing else. Has weird behaviors (language crossing).
- R1: we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

Failures
PRM: need to define “steps,” it’s hard to label, and reward hacking.

MCTS: just said it’s hard to train a value model, not sure what that means.
