Paper

image.png

image.png

image.png

Their GRPO had some serious instability though, so they also feature Routing Replay to reduce forward sensitivity

image.png