• Policy gradient math, leading up to vanilla policy gradient (direct gradient ascent) and various formulations of $\nabla_\theta J(\theta)$
    • Full derivations, mostly verbatim from Joshua Achiam’s super clear lecture, along with some notes:

      Untitled

    • Use this trick all over RL: $\nabla \log f(x) = \frac{ \nabla f(x) }{f(x)}$

      Untitled

    • Drop terms that actually don’t depend on $\theta$—these are just the environment’s stochasticity, and env’s behavior doesn’t care about/respond to your policy thus grad is 0

      Untitled

    • Variance is high because there will be many terms that have expectation 0 (don’t contrib anything) but sampled anyway, adding noise to the gradient optimization journey. What are these terms? See next slide

      Untitled

    • Remove these terms that average out to 0.

      Untitled

    • Now we can rewrite as $Q^\pi$ by splitting up the expectation into before t and after t. The after t part is exactly the action-value! This step is a bit mathier, need to consult the Spinning Up site for details.

      Untitled

    • Now introduce baselines. (Set up for next slide.) Actually could be anything, since b is just a constant. Again using the log derivative trick.

      Untitled

    • Let’s set baseline to the value state function. Why reframe in terms of advantages? If only two rewards are +100 and +101, we want to pick +101, since +100 is like losing 0.5.

      Untitled

    • Review. Basically we’ve just showed multiple formulations, but often use the last one (formulated in terms of advantages.)

      Untitled

    • Another view, from Schulman et al., 2016:

      Untitled

    • Review—this is the final gradient in vanilla policy gradient:

      Untitled

    • How to get the baseline? Learn it from data with function approximator—this is another NN $\phi$.

      Untitled

    • [See also section on actor critic / variance reduction] How to calculate advantage function, given data and the NN for V? We could use the full trajectory as the sample estimate for $Q$, but instead we only use the first n steps, and use the value estimator for the rest of it. Low $n$ means whatever’s wrong with our $V$ will be wrong with our estimate, but the only variance is the immediate reward (and the next state stochasticity). With infinite $n$, then you accept all the variance that the environment produced in this particular sample, but in expectation it will lead you to exactly $Q$. Usually set this hyperparameter to between 0.9 and 0.97.

      Untitled