Monte Carlo tree search

MCTS

Confusing terminology in literature refers to “leafs”, but these are simply non-fully-expanded nodes (still learning about its distribution).
We just add up to one new node, into a partially expanded node. Only once it’s fully expanded can we recursively start expanding its children.
Rollout plays the rest of the game randomly. Backprop adds info for that new child to inform the (partially) expanded parent.

# main function for the Monte Carlo Tree Search
def monte_carlo_tree_search(root):
	while resources_left(time, computational power):
		leaf = traverse(root) 
		simulation_result = rollout(leaf)
		backpropagate(leaf, simulation_result)
	return best_child(root)

# function for node traversal
def traverse(node):
	while fully_expanded(node):
		node = best_uct(node)
	# in case no children are present / node is terminal 
	return pick_unvisited(node.children) or node 

# function for the result of the simulation
def rollout(node):
	while non_terminal(node):
		node = rollout_policy(node)
	return result(node) 

# function for randomly selecting a child node
def rollout_policy(node):
	return pick_random(node.children)

# function for backpropagation
def backpropagate(node, result):
	if is_root(node) return
	node.stats = update_stats(node, result) 
	backpropagate(node.parent, result)

# function for selecting the best child
# node with highest number of visits
def best_child(node):
	pick child with highest number of visits

Untitled

Code

import math
import random

class Node:
    def __init__(self, value, children=None):
        self.value = value
        self.children = children or []
        self.visits = 0
        self.score = 0

def uct_score(node, parent_visits):
    if node.visits == 0:
        return float('inf')
    return node.score / node.visits + math.sqrt(2 * math.log(parent_visits) / node.visits)

def mcts(root, iterations):
    for _ in range(iterations):
        node = root
        path = []
        
        # Selection and Expansion
        while node.children:
            # This should alternate to min() for 2-player games
            node = max(node.children, key=lambda n: uct_score(n, node.visits))
            path.append(node)
        # Expansion - this is where the 
        if node.children:
            while node.children:
                node = random.choice(node.children)
                path.append(node)
        # Simulation
        if node.children:
            is_maximizing = len(path) % 2 == 0  # Even depth is Max's turn
            while node.children:
                if is_maximizing:
                    node = max(node.children, key=lambda n: n.value)
                else:
                    node = min(node.children, key=lambda n: n.value)
                path.append(node)
                is_maximizing = not is_maximizing
        
        # Backpropagation
        value = node.value
        for node in reversed(path):
            node.visits += 1
            node.score += value
    
    return max(root.children, key=lambda n: n.score / n.visits)

# Tree setup
A = Node('A', [Node('B', [Node(3), Node(5)]), Node('C', [Node(2), Node(9)])])

# MCTS evaluation
best_move = mcts(A, 1000)
print(f"Best move: {best_move.value}")

Stages:
1. Selection: Starting from the root node, recursively select child nodes until a leaf node is reached. This step uses a policy like Upper Confidence Bound for Trees (UCT) to decide which child node to select.
2. Expansion: If the leaf node is not a terminal state, one or more child nodes are added to represent possible moves.
3. Simulation: From the newly added node, a simulation (or rollout) is performed, typically by making random moves until a terminal state is reached.
4. Backpropagation: The result of the simulation is propagated back through the tree to update the statistics of the nodes involved in the simulation.

Contrast with minimax

def minimax(node, is_maximizing):
    if node.is_leaf():
        return node.value
    
    if is_maximizing:
        return max(minimax(child, False) for child in node.children)
    else:
        return min(minimax(child, True) for child in node.children)

Example
- Let's use a simple two-player game called "Number Picking" for our example:
  1. We start with a small tree of numbers.
  2. Two players (Max and Min) take turns choosing a path down the tree.
  3. Max tries to get the highest number, while Min tries to get the lowest.
  4. The game ends when they reach a leaf node, and that's the final score.
- Game tree:
```
       A
     /   \\
    B     C
   / \\   / \\
  3   5 2   9
```
- Minimax:
  - At B: Max chooses 5
  - At C: Min chooses 2
  - At A: Max chooses between 5 and 2, so 5
- MCTS:
  - MCTS will run many simulations, gradually building up statistics on which moves lead to better outcomes.
UCB1 for Trees (UCT) is ($w_i$ can be a cumulative numeric score)
- Intuition: balance exploitation (first term is the reward) and exploration (second term divides by number of visits)
- Generalization

More generic pseudocode:

def MCTS(root):
    for _ in range(number_of_simulations):
        node = Selection(root)
        child_node = Expansion(node)
        result = Simulation(child_node)
        Backpropagation(child_node, result)
    return BestMove(root)

def Selection(node):
    while node is fully expanded and not terminal:
        node = best_child(node)  # this is where UCT is incorporated
    return node

def Expansion(node):
    if not node is terminal:
        return expand_node(node)
    return node

def Simulation(node):
    while not node is terminal:
        node = random_play(node)
    return result_of_the_game(node)

def Backpropagation(node, result):
    while node is not None:
        node.update(result)
        node = node.parent

def BestMove(root):
    return child_with_highest_visit_count(root)

https://gibberblot.github.io/rl-notes/single-agent/mcts.html

Vs. value/advantage functions only?
- MCTS simulates actual steps, and estimating from further down the path is more accurate than further up. (discussion)
UCB1 vs UCT vs PUCT
- UCB1 (Upper Confidence Bound 1):
  
  UCB1 = X̄ᵢ + C * √(ln N / nᵢ)
  
  Where:
  - X̄ᵢ is the average reward of action i
  - N is the total number of trials (across all actions)
  - nᵢ is the number of times action i has been tried
  - C is an exploration constant (typically √2)
  UCB1 was originally designed for the multi-armed bandit problem, not specifically for trees
- UCT (Upper Confidence Bounds for Trees):
  
  UCT = X̄ᵢ + C * √(ln N / nᵢ)
  
  Where:
  - X̄ᵢ is the average reward of moves from node i
  - N is the number of times the parent node has been visited
  - nᵢ is the number of times child node i has been visited
  - C is an exploration constant (often tuned for the specific problem)
- PUCT formula: PUCT = Q(s,a) + c_puct * P(s,a) * (√N(s)) / (1 + N(s,a))
  - Q(s,a) is the estimated value of action a in state s
  - P(s,a) is the prior probability of selecting action a in state s
  - N(s) is the visit count of state s
  - N(s,a) is the visit count of action a from state s
  - c_puct is a constant that controls exploration
- X or Q:
  - UCB1: Direct rewards from actions.
  - UCT: Backed-up values from Monte Carlo simulations.
  - PUCT: Uses Q-function which generalizes value estimation
- https://claude.ai/chat/880b83d1-c523-4e01-8a49-97f00689af5c
Vs beam search
- MCTS builds up a tree for (future) simulations, unlike beam search which just tries to walk a tree
- MCTS explores the built tree using UCT (balancing value estimation and exploration), and randomly simulates the rest until the end. Beam search just keeps the top k according to some value estimation.
- MCTS makes most obvious sense with a smallish action space/branching factor, since unlikely to re-traverse the same actions
  - N and n are likely staying just 0 or 1
  - PUCT just makes the exploration part weighted by the policy probs
- Beam search is greedy, so can miss ultimately-optimal solutions, esp. if greedy is not a good heuristic
- Note: MCBS (Monte Carlo Beam Search) is a much simpler algorithm, that just incorporates MC simulation into beam search as the value estimator.
```
function MCBS(root_state, beam_width, num_iterations, simulation_depth):
    beam = [root_state]
    
    for iteration in 1 to num_iterations:
        next_beam = []
        
        for state in beam:
            children = generate_children(state)
            
            for child in children:
                score = monte_carlo_simulation(child, simulation_depth)
                child.score = score
                next_beam.append(child)
        
        next_beam = sort_by_score(next_beam)
        beam = select_top_k(next_beam, beam_width)
    
    return best_state_in(beam)
```
References
- https://jonathan-hui.medium.com/monte-carlo-tree-search-mcts-in-alphago-zero-8a403588276a