DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025

Achieves o1 perf with their open weight model DS-v3-base.
R1-Zero: just RL with this prompt, outcome+format rewards, nothing else. Has weird behaviors (language crossing).
R1: we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

Failures

PRM: need to define “steps,” it’s hard to label, and reward hacking.

MCTS: just said it’s hard to train a value model, not sure what that means.