-
Divide and conquer MCTS for goal-directed planning, Parascandolo, Deepmind 2020
- Idea: Find intermediate goals, and solve those (simpler) problems, recursively.
- Need to try several intermediate goals. So it’s almost like a tree search over tree searches.
- Of course, the problem is that there are many possible intermediate goals (O(S) where S is state space). So everything rests on: can you learn a NN that predicts good candidates?
-
Curiosity-driven exploration by self-supervised prediction, 2017
- Common to have sparse rewards only at the end of the level or game, many steps away
- Model builders commonly do reward shaping, which gives you a bit of reward that you know are good things to have along the way, but these are just human designed heuristics and require domain specific expertise
- Can train a model to predict the future. E.g. in VizDoom, if I take a step forward, what will happen / what will I see? If I’m wrong, that gives me a reward. Maximize surprise. Good for the algo to explore by itself.
![Untitled](<https://prod-files-secure.s3.us-west-2.amazonaws.com/fd0e9e06-47ef-46a5-bfc2-18344b58b466/0cd5a84f-0d79-44e4-80df-7e5d98e84e4b/Untitled.png>)
- A naive implementation would simply predict next states $s_{t+1}$, and compare that with actual next state. Problem: there will be parts of the env that change unexpectedly but which your actions have no influence over, such as the movement of some leaves. Algo will always be surprised/rewarded by the leaves.
- Instead, want to predict and compare features of states that depend on actions—learn these with an encoder network $\phi$. Have an inverse model that, given current/next features, predicts the original action. Train the encoder end-to-end with this inverse model. Once you have this $\phi$, train the forward model.
- Yannic
- Paper
-
Hindsight experience replay, OpenAI 2017
-
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions TODO
-
MuZero, Deepmind 2020
-
AlphaZero, Deepmind 2018
-
AlphaGo Zero, Deepmind 2017
-
AlphaGo, Deepmind 2015
- Uses MCTS
- Game background: 19x19 grid
- 3 policy networks
- Supervised policy network trained on DB of human plays (30M positions). CNN. Predict what’s the next move (classification).
- Achieved 57% accuracy vs 44% sota
- Rollout policy. Small linear layer. Also classification. Some go handcrafting. Fast, 2us (vs 3ms for SL).
- RL policy network. Initialized from SL. Self-play trained with policy gradient methods.
- Say P1 wins. P1’s moves become more probable, P2’s less.
- Randomly select a prev iteration of the policy network as the opponent.
- Self-play games are added to the SL pool?
- Not used in final algo. Used to bootstrap/train the value network.
- Value network predicts likelihood to win, given state. Trained with (state, game outcome) from the self-play games.
- Positions in a single game are highly correlated (violates IID), causes overfitting. So sample just one position per game. Generated 30M dataset.
- https://www.youtube.com/watch?v=Z1BELqFQZVM
-
Learning hierarchy of actions with meta learning, OpenAI 2017
- Learn to apply sub-policies