Paper
- Per-token importance sampling is weird, importance sampling should be an expectation but only one sample
- So use sequence-level-constant importance ratio



Their GRPO had some serious instability though, so they also feature Routing Replay to reduce forward sensitivity
