Paper

image.png

Failures

PRM: need to define “steps,” it’s hard to label, and reward hacking.

image.png

MCTS: just said it’s hard to train a value model, not sure what that means.

image.png