https://arxiv.org/abs/2009.04416

Similar:

alternates between (1) branching off and training a policy at low LR with RL, (2) distilling that back into the main network, along with various auxiliary objectives, at high LR.

image.png