Taxonomy
Insights - when to use RL?
Definitions
Value-based (learn Q function) vs policy-based (directly optimize the model)
Both want policy that maximizes return:
Why policy learning? (Abbeel lec)
Visualizations (note left is state values, not Q, and often actually used in policy gradient):
Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.
More on disadvantages of policy learning:
More on policy learning vs Q learning
Can be mixed/interpolated. Sometimes equiv.
Model-based vs model-free
In this chapter we develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-di↵erence methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.
Stateless, just actions → reward distribution:
Action-value approach: can estimate this with simply
Has nonstationary variants