Minimizing KL(q||p) = maximizing rewards (aka policy gradient) - a KL penalty
Twist functions