Policy gradient derivation

Policy gradient math, leading up to vanilla policy gradient (direct gradient ascent) and various formulations of $\nabla_\theta J(\theta)$
- Full derivations, mostly verbatim from Joshua Achiam’s super clear lecture, along with some notes:
- Use this trick all over RL: $\nabla \log f(x) = \frac{ \nabla f(x) }{f(x)}$
- Drop terms that actually don’t depend on $\theta$—these are just the environment’s stochasticity, and env’s behavior doesn’t care about/respond to your policy thus grad is 0
- Variance is high because there will be many terms that have expectation 0 (don’t contrib anything) but sampled anyway, adding noise to the gradient optimization journey. What are these terms? See next slide
- Remove these terms that average out to 0.
- Now we can rewrite as $Q^\pi$ by splitting up the expectation into before t and after t. The after t part is exactly the action-value! This step is a bit mathier, need to consult the Spinning Up site for details.
- Now introduce baselines. (Set up for next slide.) Actually could be anything, since b is just a constant. Again using the log derivative trick.
- Let’s set baseline to the value state function. Why reframe in terms of advantages? If only two rewards are +100 and +101, we want to pick +101, since +100 is like losing 0.5.
- Review. Basically we’ve just showed multiple formulations, but often use the last one (formulated in terms of advantages.)
- Another view, from Schulman et al., 2016:
- Review—this is the final gradient in vanilla policy gradient:
- How to get the baseline? Learn it from data with function approximator—this is another NN $\phi$.
- [See also section on actor critic / variance reduction] How to calculate advantage function, given data and the NN for V? We could use the full trajectory as the sample estimate for $Q$, but instead we only use the first n steps, and use the value estimator for the rest of it. Low $n$ means whatever’s wrong with our $V$ will be wrong with our estimate, but the only variance is the immediate reward (and the next state stochasticity). With infinite $n$, then you accept all the variance that the environment produced in this particular sample, but in expectation it will lead you to exactly $Q$. Usually set this hyperparameter to between 0.9 and 0.97.