Refs
TODO
https://awllee.github.io/smc-tutorial/smc-tutorial.html#42
Minimizing KL(q||p) = maximizing rewards (aka policy gradient) - a KL penalty
Twist functions