Paper
- off policy breaks PG, doesn’t converge to optimal policy.
- what even is optimal for a KL[train||ref] PG? $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)$
- if you plug this into the “regularized reward” $R^\pi_\beta(x,y) = r(x,y) - \beta \log \frac{\pi}{\pi_{\text{ref}}}$, the $r(x,y)$ terms cancel, and you get a constant across any $y$.
- so this gives us another way to find $\pi^*$: instead of maximizing expected reward, find where $R$ is constant, i.e. minimize variance
- this works for off policy, because $L_\mu(\pi) = \text{Var}{y \sim \mu}[R^\pi\beta(x,y)]$ is even in terms of any $\mu \ne \pi$

