Rewards can be hard to create, but you have expert demos. Instead of SFT on expert demonstrations, RL with reward determined by a critic that distinguishes policy responses from expert responses. Shared weights, pairwise comparisons, trained together.

https://arxiv.org/pdf/2511.21667