Rewards can be hard to create, but you have expert demos. Instead of SFT on expert demonstrations, RL with reward determined by a critic that distinguishes policy responses from expert responses. Shared weights, pairwise comparisons, trained together.