Full derivations, mostly verbatim from Joshua Achiam’s super clear lecture, along with some notes:
Use this trick all over RL: $\nabla \log f(x) = \frac{ \nabla f(x) }{f(x)}$
Drop terms that actually don’t depend on $\theta$—these are just the environment’s stochasticity, and env’s behavior doesn’t care about/respond to your policy thus grad is 0
Variance is high because there will be many terms that have expectation 0 (don’t contrib anything) but sampled anyway, adding noise to the gradient optimization journey. What are these terms? See next slide
Remove these terms that average out to 0.
Now we can rewrite as $Q^\pi$ by splitting up the expectation into before t and after t. The after t part is exactly the action-value! This step is a bit mathier, need to consult the Spinning Up site for details.
Now introduce baselines. (Set up for next slide.) Actually could be anything, since b is just a constant. Again using the log derivative trick.
Let’s set baseline to the value state function. Why reframe in terms of advantages? If only two rewards are +100 and +101, we want to pick +101, since +100 is like losing 0.5.
Review. Basically we’ve just showed multiple formulations, but often use the last one (formulated in terms of advantages.)
Another view, from Schulman et al., 2016:
Review—this is the final gradient in vanilla policy gradient:
How to get the baseline? Learn it from data with function approximator—this is another NN $\phi$.
[See also section on actor critic / variance reduction] How to calculate advantage function, given data and the NN for V? We could use the full trajectory as the sample estimate for $Q$, but instead we only use the first n steps, and use the value estimator for the rest of it. Low $n$ means whatever’s wrong with our $V$ will be wrong with our estimate, but the only variance is the immediate reward (and the next state stochasticity). With infinite $n$, then you accept all the variance that the environment produced in this particular sample, but in expectation it will lead you to exactly $Q$. Usually set this hyperparameter to between 0.9 and 0.97.