References
Taxonomy
Insights - when to use RL?
Definitions
Value-based (learn Q function) vs policy-based (directly optimize the model)
Both want policy that maximizes return:
But value-based ends up with deterministic policy
Visualizations (note left is state values, not Q, and often actually used in policy gradient):
Why policy learning?
Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.
More on disadvantages of policy learning:
More on policy learning vs Q learning
Can be mixed/interpolated. Sometimes equiv.
Monte Carlo vs TD learning (for value based methods)
Q-learning in general
Taxonomy: off-policy, value-based (learn action-value function $Q$), and TD learning
Within the key steps framework:
Want to optimize J (by learning and maximizing Q value function):
Thus only applies to discrete, not continuous, action spaces, due to argmax
Bellman equations: [state] value functions and action-value functions, both on-policy and optimal
These value functions satisfy recursive Bellman equations:
Optimize mean squared Bellman error (compares with Q*):
Epsilon greedy: choose argmax with $\epsilon$ weight and choose random with $1-\epsilon$ weight
Classic Q-learning algorithm—careful with the notation diff. Notice also how this embodies both the error above and the TD learning expression further up.
On-policy (Sarsa) vs off-policy (Q-learning aka Sarsamax)
Deep Q learning: use a DNN function approximator to learn Q
Importantly, don’t propagate gradient through y (see next slide)
Practical implementation concerns: sampling and stability (bootstrapping with function approximators is very unstable—typically values will explode)
Pseudocode
Double DQN: train 2 indep networks on disjoint subsets of experience. One is used to select actions, other is used to evaluate actions.
Not to be confused with holding fixed Q network!
Motivation: max is prone to overestimation.
Applied Deep Q learning
Policy optimization background
Taxonomy: on-policy, policy-based, Monte Carlo
Within key steps framework:
Again want policy that maximizes J (using gradient ascent).
“On-policy”: notice all the sampling in what follows is from the policy, $\tau \sim \pi_\theta$
[TODO understand this] Policy gradient theorem: Computing $\nabla_\theta J(\theta)$ is tricky. Theorem reformulates to not involve derivative of state distribution and simplifies the computation a lot.
Want:
Policy gradient math, leading up to vanilla policy gradient (direct gradient ascent) and various formulations of $\nabla_\theta J(\theta)$
Full derivations, mostly verbatim from Joshua Achiam’s super clear lecture, along with some notes: