See also
- Policy gradient derivation
References
- Sutton & Barlow 2e 2020 full book - don’t sleep on this, this is a must-read. Talks about everything I was slowly cobbling together—bandits, DP, MC, TD/Q, planning/MCTS, policy gradient, actor critic, more…
- Kevin Murphy’s RL overview
- Spinning up lecture (old slides, new slides 1, new slides 2)
- Good taxonomy and precise defns at https://spinningup.openai.com/
- Serrano lessons
- https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
- HF course: https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
- Weg's Tutorials

Quick insights

RL is iterative synthetic data generation
RL is control system (like PID control) and worries about stability
RL can be like iterative SL
Value function is learning dense rewards
RL advantage has imitation and anti-imitation
Inverse RL is ~like learning value function
Planning…

Overview

Taxonomy
- Key steps in all RL algorithms:
  - Run policy: actually act in env
  - Evaluate policy: estimate V" or Q*
  - Improve policy: do something which lets you pick better actions
- Design decisions:
  - Use a model of the env or not (can slot into any of those three steps)
  - Optimize stochastic policy directly or learn Q* as main controller
- Broader taxonomy:
  - policy learning includes and typically means policy optimization, but can include other searches like hill climbing, simulated annealing, or evolution
  - value-based includes and typically means Q-learning
Insights - when to use RL?
- When you want super human performance, so learning from human data or imitation learning doesn’t cut it
- When there’s something non-differentiable / you don’t have all the data. E.g. a robot that must take an action—you don’t know what would happen for all possible actions. E.g. a robot that must decide where to look—contrast this with having all the data and simply training what to focus on.
- When humans know what they want but don’t know how to do it (human prefs)
- When you don’t have the data. RL, and approaches like self-play/HER/expert iteration, is a way to turn compute into data.
Definitions
- Policy can be stochastic (e.g. over discrete actions) $a_t \sim \pi(\cdot | s_t)$ or deterministic (e.g. over continuous actions) $a_t = \pi(s_t)$
- On-policy vs off-policy
  - On-policy: I learn from my own actions
  - Off-policy: I learn from anyone trying for any goal
- Advantage is actual return minus expected return at the state. Basically this is our positive or negative loss.
Value-based (learn Q function) vs policy-based (directly optimize the model)
- Both want policy that maximizes return:
- Why policy learning? (Abbeel lec)
  - Can be simpler than Q or V (e.g. robotic grasp—continuous quantities)
  - V doesn’t prescribe actions—would need dynamics model to do lookahead against value function
  - Can’t just read off action from Q. Not always efficient to solve argmax (for continuous / high dim)
  - Even for discrete actions, being stochastic allows smoother learning than being discrete/rule-based (value-based ends up with a deterministic policy)
- Visualizations (note left is state values, not Q, and often actually used in policy gradient):
  - Left shows value learning, you evaluate all actions to search for the argmax of Q(s,a)
  - Right shows policy learning, you train network to directly predict the best next action for a state
- Trade-offs:
  
  Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.
- More on disadvantages of policy learning:
  - Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.
  - Policy-gradient goes slower, step by step: it can take longer to train (inefficient).
  - Policy-gradient can have high variance. We’ll see in the actor-critic unit why, and how we can solve this problem.
- More on policy learning vs Q learning
  - Policy learning
    - "Just take the gradient"
    - Stable, easy to use
    - Very few tricks needed
    - On policy
  - Q learning
    - Less stable
    - More sample efficient
    - Won’t explain how it works
    - Off policy: can be trained on data generated by some other policy!
- Can be mixed/interpolated. Sometimes equiv.
Model-based vs model-free

In this chapter we develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-di↵erence methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.

Multi-armed bandits

Stateless, just actions → reward distribution:
Action-value approach: can estimate this with simply
Has nonstationary variants