• MCTS

    • Confusing terminology in literature refers to “leafs”, but these are simply non-fully-expanded nodes (still learning about its distribution).
    • We just add up to one new node, into a partially expanded node. Only once it’s fully expanded can we recursively start expanding its children.
    • Rollout plays the rest of the game randomly. Backprop adds info for that new child to inform the (partially) expanded parent.
    # main function for the Monte Carlo Tree Search
    def monte_carlo_tree_search(root):
    	while resources_left(time, computational power):
    		leaf = traverse(root) 
    		simulation_result = rollout(leaf)
    		backpropagate(leaf, simulation_result)
    	return best_child(root)
    
    # function for node traversal
    def traverse(node):
    	while fully_expanded(node):
    		node = best_uct(node)
    	# in case no children are present / node is terminal 
    	return pick_unvisited(node.children) or node 
    
    # function for the result of the simulation
    def rollout(node):
    	while non_terminal(node):
    		node = rollout_policy(node)
    	return result(node) 
    
    # function for randomly selecting a child node
    def rollout_policy(node):
    	return pick_random(node.children)
    
    # function for backpropagation
    def backpropagate(node, result):
    	if is_root(node) return
    	node.stats = update_stats(node, result) 
    	backpropagate(node.parent, result)
    
    # function for selecting the best child
    # node with highest number of visits
    def best_child(node):
    	pick child with highest number of visits
    

    Untitled

    Untitled

  • Code

    import math
    import random
    
    class Node:
        def __init__(self, value, children=None):
            self.value = value
            self.children = children or []
            self.visits = 0
            self.score = 0
    
    def uct_score(node, parent_visits):
        if node.visits == 0:
            return float('inf')
        return node.score / node.visits + math.sqrt(2 * math.log(parent_visits) / node.visits)
    
    def mcts(root, iterations):
        for _ in range(iterations):
            node = root
            path = []
            
            # Selection and Expansion
            while node.children:
                # This should alternate to min() for 2-player games
                node = max(node.children, key=lambda n: uct_score(n, node.visits))
                path.append(node)
            # Expansion - this is where the 
            if node.children:
                while node.children:
                    node = random.choice(node.children)
                    path.append(node)
            # Simulation
            if node.children:
                is_maximizing = len(path) % 2 == 0  # Even depth is Max's turn
                while node.children:
                    if is_maximizing:
                        node = max(node.children, key=lambda n: n.value)
                    else:
                        node = min(node.children, key=lambda n: n.value)
                    path.append(node)
                    is_maximizing = not is_maximizing
            
            # Backpropagation
            value = node.value
            for node in reversed(path):
                node.visits += 1
                node.score += value
        
        return max(root.children, key=lambda n: n.score / n.visits)
    
    # Tree setup
    A = Node('A', [Node('B', [Node(3), Node(5)]), Node('C', [Node(2), Node(9)])])
    
    # MCTS evaluation
    best_move = mcts(A, 1000)
    print(f"Best move: {best_move.value}")
    
  • Stages:

    1. Selection: Starting from the root node, recursively select child nodes until a leaf node is reached. This step uses a policy like Upper Confidence Bound for Trees (UCT) to decide which child node to select.
    2. Expansion: If the leaf node is not a terminal state, one or more child nodes are added to represent possible moves.
    3. Simulation: From the newly added node, a simulation (or rollout) is performed, typically by making random moves until a terminal state is reached.
    4. Backpropagation: The result of the simulation is propagated back through the tree to update the statistics of the nodes involved in the simulation.
  • Contrast with minimax

    def minimax(node, is_maximizing):
        if node.is_leaf():
            return node.value
        
        if is_maximizing:
            return max(minimax(child, False) for child in node.children)
        else:
            return min(minimax(child, True) for child in node.children)
    
  • Example

    • Let's use a simple two-player game called "Number Picking" for our example:

      1. We start with a small tree of numbers.
      2. Two players (Max and Min) take turns choosing a path down the tree.
      3. Max tries to get the highest number, while Min tries to get the lowest.
      4. The game ends when they reach a leaf node, and that's the final score.
    • Game tree:

             A
           /   \\
          B     C
         / \\   / \\
        3   5 2   9
      
    • Minimax:

      • At B: Max chooses 5
      • At C: Min chooses 2
      • At A: Max chooses between 5 and 2, so 5
    • MCTS:

      • MCTS will run many simulations, gradually building up statistics on which moves lead to better outcomes.
  • UCB1 for Trees (UCT) is ($w_i$ can be a cumulative numeric score)

    image.png

    • Intuition: balance exploitation (first term is the reward) and exploration (second term divides by number of visits)

    • Generalization

      image.png

  • More generic pseudocode:

    def MCTS(root):
        for _ in range(number_of_simulations):
            node = Selection(root)
            child_node = Expansion(node)
            result = Simulation(child_node)
            Backpropagation(child_node, result)
        return BestMove(root)
    
    def Selection(node):
        while node is fully expanded and not terminal:
            node = best_child(node)  # this is where UCT is incorporated
        return node
    
    def Expansion(node):
        if not node is terminal:
            return expand_node(node)
        return node
    
    def Simulation(node):
        while not node is terminal:
            node = random_play(node)
        return result_of_the_game(node)
    
    def Backpropagation(node, result):
        while node is not None:
            node.update(result)
            node = node.parent
    
    def BestMove(root):
        return child_with_highest_visit_count(root)
    

    image.png

    image.png

    image.png

    image.png

    image.png

    image.png

    • https://gibberblot.github.io/rl-notes/single-agent/mcts.html
  • Vs. value/advantage functions only?

    • MCTS simulates actual steps, and estimating from further down the path is more accurate than further up. (discussion)
  • UCB1 vs UCT vs PUCT

    • UCB1 (Upper Confidence Bound 1):

      UCB1 = X̄ᵢ + C * √(ln N / nᵢ)

      Where:

      • X̄ᵢ is the average reward of action i
      • N is the total number of trials (across all actions)
      • nᵢ is the number of times action i has been tried
      • C is an exploration constant (typically √2)

      UCB1 was originally designed for the multi-armed bandit problem, not specifically for trees

    • UCT (Upper Confidence Bounds for Trees):

      UCT = X̄ᵢ + C * √(ln N / nᵢ)

      Where:

      • X̄ᵢ is the average reward of moves from node i
      • N is the number of times the parent node has been visited
      • nᵢ is the number of times child node i has been visited
      • C is an exploration constant (often tuned for the specific problem)
    • PUCT formula: PUCT = Q(s,a) + c_puct * P(s,a) * (√N(s)) / (1 + N(s,a))

      • Q(s,a) is the estimated value of action a in state s
      • P(s,a) is the prior probability of selecting action a in state s
      • N(s) is the visit count of state s
      • N(s,a) is the visit count of action a from state s
      • c_puct is a constant that controls exploration
    • X or Q:

      • UCB1: Direct rewards from actions.
      • UCT: Backed-up values from Monte Carlo simulations.
      • PUCT: Uses Q-function which generalizes value estimation
    • https://claude.ai/chat/880b83d1-c523-4e01-8a49-97f00689af5c

  • Vs beam search

    • MCTS builds up a tree for (future) simulations, unlike beam search which just tries to walk a tree

    • MCTS explores the built tree using UCT (balancing value estimation and exploration), and randomly simulates the rest until the end. Beam search just keeps the top k according to some value estimation.

    • MCTS makes most obvious sense with a smallish action space/branching factor, since unlikely to re-traverse the same actions

      • N and n are likely staying just 0 or 1
      • PUCT just makes the exploration part weighted by the policy probs
    • Beam search is greedy, so can miss ultimately-optimal solutions, esp. if greedy is not a good heuristic

    • Note: MCBS (Monte Carlo Beam Search) is a much simpler algorithm, that just incorporates MC simulation into beam search as the value estimator.

      function MCBS(root_state, beam_width, num_iterations, simulation_depth):
          beam = [root_state]
          
          for iteration in 1 to num_iterations:
              next_beam = []
              
              for state in beam:
                  children = generate_children(state)
                  
                  for child in children:
                      score = monte_carlo_simulation(child, simulation_depth)
                      child.score = score
                      next_beam.append(child)
              
              next_beam = sort_by_score(next_beam)
              beam = select_top_k(next_beam, beam_width)
          
          return best_state_in(beam)
      
      
  • References

    • https://jonathan-hui.medium.com/monte-carlo-tree-search-mcts-in-alphago-zero-8a403588276a