Sunday, January 12, 2025

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the structure of a tree search with the power of random sampling to navigate large, complex search spaces effectively. I've been interested in MCTS for a long time - this interest was cemented in 2016 by the victory of Alpha Go over Lee Sedol and the famous "Move 37." By the way, I wrote an entire post about Move 37 and the role of innovation in AI here. Move 37, a move that was thought at best to be "unique" but was at the moment the move was played thought to be a horrible move - a move that no respectable Go player would make. But the move was both effective and innovative. It was this balance between exploiting an immediated advantage and exploring the space for potentially even better solutions that gave Alpha Go the victory in a game that because of its complexity and need for imagination was thought impossible for AI. This exploitation vs exploration dynamic is the crux of MCTS and it along with neural network evaluations and reinforcement learning gave Alpha Go the victory.

In this post, I want to talk about the main ideas behind MCTS, where MCTS can be used, and then walk through the code for implementing a MCTS in a game of Connect 4. But if you want to skip right to playing the game: it is here and the Github code is here.

So what kind of applications is MCTS appropriate for? Historically the typical scenario is around games and specifically finite two-person zero-sum sequential games. In other words, in these types of games, there are two players competing directly against each other, the gain of one player is exactly the loss of the other (zero-sum), the game progresses through sequential moves, and the game has a finite number of states and actions, leading to a clear end.

These two player games are the typical application of MCTS, but I've long been interested in applications beyond these types of games. Any problem where there is a large search space, that has tree like decision nodes that operate sequentially under limited resource or time constraints. Any kind of optimization problem with these characteristics MCTS could be applicable. Robotics is an obvious example. A robot has to complete some objective, but there are many different paths to complete that objective. But how about some not so obvious applications. In education, MCTS could help in what I'm calling Personalized Learning Path Optimization (PLPO). A PLPO could as a student engages with learning activities customize a personalized curriculum for a student in an adaptive learning platform, balancing engagement, challenge, and skill acquisition. In this situation, the state represents the student’s current skill level, engagement, and performance history. Actions include presenting different types of learning activities (e.g., videos, quizzes, projects). The MCTS simulations predict the student’s response to a sequence of learning activities, optimizing for long-term improvement and retention.

In biotech, an example would be to optimize the sequence of biological reactions in a synthetic pathway for producing potential drugs. The states would represent the current pathway, including enzymes and intermediates. Actions involve adding, removing, or modifying reactions in the pathway. Then MCTS simulations predict yield, efficiency, and stability, optimizing for production goals and feasibility.

Another non-typical example I want to talk at length about is optimizing a customer journey designed for e-commerce platforms. In this example, the goal would be to guide consumers through a personalized journey (e.g., product recommendations, promotional offers, and website interactions) to maximize conversions, repeat purchases, and customer satisfaction.

I want to come back to this example of using MCTS in optimizing the customer journey, but first I want to go into a larger explanation of how MCTS works in the typical case of two person games.

MCTS combines the structure of a tree search with random sampling to navigate large, complex search spaces. Unlike traditional search methods that often rely on complete or near complete look-ahead, MCTS focuses on sampling potential outcomes and incrementally refining its estimates of how good or bad certain moves or actions are. This makes MCTS particularly appealing in domains where exhaustive searches become prohibitively large.

One way to understand MCTS is by tracing the iterative loop through four stages:

  1. Selection
  2. Expansion
  3. Simulation
  4. Backpropagation

Each of these stages contributes to building up a more accurate picture of the search space. During selection, the algorithm starts at the root of the search tree (representing the current state) and travels down along existing paths, choosing actions that optimize the balance between exploitation (picking actions that have worked well so far) and exploration (investigating actions that remain relatively untested). A common way to manage this trade off is by using the Upper Confidence Bound applied to Trees (UCT) formula. If \( N \) is the total number of visits to the current node and \( n \) is the number of visits to a child node, while \( \overline{X} \) is the current estimate of the child’s average reward, then the child node’s UCT value can be computed as:

\( \text{UCT} = \overline{X} + c \sqrt{\frac{\ln(N)}{n}} \)

\( \overline{X} \) = current estimate of the child’s average reward
\( c \) = constant controlling the exploration-exploitation balance
\( N \) = total number of visits to the current node
\( n \) = number of visits to a child node

Here a larger \( c \) encourages more exploration of unexplored moves. The goal of this formula is to favor actions that have high average rewards while still allocating some attempts to actions with few visits (because they might lead to surprising or better results once explored more thoroughly). This is how we get moves that might look to a bystander as "unusual" or "non-standard."

After reaching a leaf node, either a node that has not been explored yet, or one that is not fully expanded with potential child nodes, the algorithm proceeds to the expansion phase. In this stage, it creates one or more child nodes corresponding to unvisited actions. This expansion step gradually grows the search tree over time, ensuring that new parts of the state space are discovered rather than staying confined to what has already been visited.

The third stage, simulation, or you will often hear it referred to as the playout, is where the “Monte Carlo” part comes into play. From one of the newly added child nodes, MCTS simulates a sequence of moves (often random or guided by a light heuristic) all the way to a terminal outcome. For example, in a game, that outcome might be a win, a loss, or a draw. In a more general planning or scheduling context, it could be a success or failure, or a particular reward value. These random playouts, repeated many times, serve as samples of what might realistically happen if a particular action path is chosen.

Once the simulation is complete, the algorithm performs backpropagation to propagate the result of the simulation back up the tree to the root. Along the path taken in the search tree, each node’s statistics - like how many simulations passed through that node, how many ended in a favorable outcome, or the average return are updated. Over many iterations, these accumulated statistics become increasingly meaningful, revealing which branches of the search tree appear promising and which ones do not.

MCTS is successful in many applications because it doesn’t require a specially devised evaluation function for intermediate states. Given enough playouts, the sampling can approximate the true value of each action. This versatility is why MCTS has been popular in board games like Go, where it formed the backbone of AlphaGo’s decision making system. It is also employed in video game AI, robotics for path planning, and even scheduling problems; anywhere the decision space is complex and uncertain.

Yet, MCTS is not without challenges. Its effectiveness depends significantly on the quality of the playouts - because remember these playouts are taken place under a constraint, which is often a time limit. If the simulations are purely random in a very large or complex domain, they might produce misleading estimates. For this reason, many implementations add a small amount of domain knowledge or heuristics to steer simulations toward more likely or more relevant outcomes (I will do this in the Connect 4 example below). Another concern is the size of the search tree itself: while MCTS is more scalable than exhaustive methods, it can still blow up in expansive problem spaces if it does not prune or limit growth effectively. Implementation details such as how states are represented, how child nodes are stored, and how results are backpropagated can significantly impact its performance.

In practice, the so-called “anytime” property of MCTS is one of its greatest strengths. You can let the algorithm run for as many or as few iterations as time allows, and it will offer the best solution it has found up to that point. Longer run times translate into more simulations, deeper exploration, and more refined estimates of action quality, but even with limited time, MCTS can generate a decent approximation of the optimal move.

So by blending random sampling with incremental, iterative updates to a tree of possibilities, MCTS circumvents the need for elaborate heuristics or full look-ahead. Through repeated applications of selection, expansion, simulation, and backpropagation, MCTS transforms what might otherwise be an intractable search into a manageable process, making it a "go to" strategy for complex game spaces like Go, Chess, and Othelo.

Before moving on to the Connect 4 code, I should mention the algorithm of minimax search with alpha/beta pruning. Historically, minimax has been the "go to" algorithm for games like Chess and the best Chess engines in the world have been built around minimax like Stockfish. Although recently a MCTS type Chess engine named Leela has competed very well with Stockfish (although technically it's not a pure MCTS - it uses a neural network to simulate the rollouts). Minimax with alpha/beta pruning is a very deterministic search. It systematically explores the game tree to a certain depth (or until terminal states). Typically, games like Chess with large branching factors must limit search depth. At the leaf nodes (or at the cutoff depth), an evaluation function (heuristic) estimates the position’s value, which is where pruning comes in. Pruning uses α (alpha) and β (beta) bounds to prune branches that cannot affect the final minimax decision.

The main difference between MCTS and minimax is that MCTS is probablistic and minimax is deterministic. Minimax is effective in games with reasonable branching factor like Chess and Checkers. But with games like Go, the number of states grows exponentially with depth of search.

So based on this discussion, you're probably thinking that minimax would be the best algorithm for Connect 4, since search space of possible moves and the depth of search is very limited - and you would be right. And certainly anyone building a game of Connect 4 would almost automatically use a minimax type algorithm. But with the following code using MCTS, I'm showing MCTS can do as well as minimax given a time constraint.

The Github repository can be found here. The most important function for the implementation is in the mcts.py file in a function unsurprisingly called mcts. This function steps through the four stages I outlined above of selection, expansion, simulation, and backpropagation. It iteratively does this under a time constraint. I have set my time constraint for 1 second - it has to come up with a move in under 1 second. In the selection stage, it is looking for the child node with the best UCT value. Once that node is found, the expansion stage expands that node by taking one untried move and creating a child node. Then the simulation stage simulates a random game of moves (rollout) from that child node until there is a winner or a draw. The backpropagation stage collects the simulation results up the nodes in that tree path. Once the while loop is finished, the node with the highest node count is the move that will be taken (which in the case of Connect 4 is the column that the piece will be dropped).

def mcts(root_board, current_player, simulations=500, time_limit=1.0):
    """
    Perform MCTS from the root state and return the column of the best move.
    """
    start_time = time.time()
    root_node = MCTSNode(root_board, current_player)
    
    while (time.time() - start_time) < time_limit:
        # 1. Selection
        node = root_node
        while not node.untried_moves and node.children:
            node = best_child(node)
        
        # 2. Expansion
        if node.untried_moves:
            node = expand_node(node)
        
        # 3. Simulation
        winner = simulate_game(node.board, node.current_player)
        
        # 4. Backpropagation
        backpropagate(node, winner)
    
    # After time is up, pick the child with the highest visit count.
    best_move_node = max(root_node.children, key=lambda c: c.visits) if root_node.children else None
    
    if best_move_node is None:
        # fallback if somehow no children
        return random.choice(get_valid_moves(root_board))
    
    # Return the column that leads to best_move_node
    for col in get_valid_moves(root_board):
        candidate_board = make_move(root_board, col, current_player)
        if candidate_board == best_move_node.board:
            return col
    
    return random.choice(get_valid_moves(root_board))

When the computer is looking to make its move, it doesn't automatically call this mcts function to get its next move. As I mentioned above, many games will include heuristics to steer the decision to look for a forced Connect 4 win or to block the opponent's forced win. Chess engines will incorporate special case checks for forced checkmates or forced draws and then do their algorithm search like the minimax. Similarly, I created a find_immediate_win_or_blockade function so as to not miss a forced win or loss that it will look at before doing the mcts function.

So this function will look to make sure the computer doesn't miss a move before it does its time budgeted move search:
def find_immediate_win_or_block(board, current_player):
    """
    Check if current_player can immediately win,
    or if the opponent can immediately win next turn (then block).
    """
    # Immediate win
    for col in get_valid_moves(board):
        temp_board = make_move(board, col, current_player)
        if check_winner(temp_board, current_player):
            return col
    
    # Block opponent
    opponent = get_next_player(current_player)
    for col in get_valid_moves(board):
        temp_board = make_move(board, col, opponent)
        if check_winner(temp_board, opponent):
            return col
    
    return None  
One more function I think I need to explain looked like the code below. This ai_move function would call the find_immediate_win_or_block function to look for an immediate win or block and if there wasn't an immediate win or block would then call the mcts function to look for a column to drop its piece. However, when I gave the link to family and friends, they complained that they couldn't ever win. So I added a 1 through 5 difficulty setting that the user can set. And if you look at the updated Github code, level 5 will do this original code, but levels 1-4 will choose a probabilistic amount of times that it will not do the MCTS and instead it will just drop a piece randomly. There are other ways to handle difficulty level, such as lessening the search time or space, but for Connect 4, mixing in a random move seems to work well.
def ai_move(board, current_player, difficulty):
    """
    Decide AI move:
    1. Check immediate win/block
    2. Otherwise use MCTS
    """
        
    col = find_immediate_win_or_block(board, current_player)
                  
    if col is not None:
        return col
                          
    return mcts(board, current_player, simulations=300, time_limit=1.0)
Initially, I wrote this Connect 4 code where the output was displayed in the terminal, but I wanted to share it with other people. But at the same time I wanted a minimalist implementation, so I did the unusual idea of doing it as a Streamlit application using emoticons. Then end result is here, which you can play.


Non-typical MCTS: Customer Journey

Now that we have an explanation and example code of the typical application of MCTS two player game, we can talk about an example of non-typical application. There are many, many possible applications, but let's just look at one, which is an e-commerce platform where the goal is to guide consumers through a personalized journey (e.g., product recommendations, promotional offers, and website interactions) to maximize conversions, repeat purchases, and customer satisfaction.

In this type of example we have:
  • States in the customer journey:
    • Customer demographics and preferences.
    • Interaction history (e.g., clicks, time spent, abandoned carts).
    • Contextual data (e.g., time of day, device type, or browsing session).
  • Actions
    • Recommending specific products.
    • Sending a targeted promotional email or notification.
    • Offering discounts or free shipping.
    • Displaying upsell or cross-sell opportunities.
  • Reward Function
    • Conversion events (e.g., purchases or sign-ups).
    • Total revenue or profit generated.
    • Customer satisfaction and engagement metrics (e.g., session length, feedback).
  • Simulations
    • Rollouts simulate different sequences of actions to predict customer responses, using probabilistic models of consumer behavior (e.g., likelihood of purchase or churn)
  • Balancing of Exploration and Exploitation
    • MCTS balances exploring new strategies (e.g., trying novel product bundles or marketing messages) with exploiting known successful approaches (e.g., strategies that previously led to high-value purchases).

For example, a customer visits an online store and views a product but doesn’t add it to their cart. The platform needs to decide: should it recommend a similar product?, offer a discount on the viewed product? or highlight customer reviews or provide an FAQ link to address hesitation? Using MCTS, the platform simulates potential outcomes of these actions (e.g., conversion likelihood) and dynamically selects the one with the highest long-term reward, considering both immediate revenue and customer loyalty.

Using MCTS for the customer journey provides dynamic personalization, long term optimization, and is scalable and probabilistic. The MCTS driven system adapts to real-time customer interactions, creating highly tailored journeys. Instead of focusing on one off conversions, the system optimizes for lifetime customer value. MCTS handles the uncertainty and variability in consumer behavior effectively, especially for large scale platforms. Using a MCTS algorithm could enable smarter decision making and better shopping experiences.

Conclusion

Monte Carlo Tree Search (MCTS) is a robust and adaptable algorithm capable of addressing a diverse range of decision making situations. It has been successfully applied in domains such as board games, but could be applied atypically in areas like robotics, education, biotechnology, and e-commerce. This is primarily because of its versatility and efficacy in complex and uncertain search spaces. By balancing exploration and exploitation, MCTS optimizes decision making in contexts where exhaustive searches are computationally infeasible.

Additionally, there are many other variations of MCTS besides the ones I have presented as outlined in this paper by Browne et al. and organized in a really nice presentation by Bobak Pezeshki here. These variations involve different ways to explore the tree, expand nodes, simulate the playouts, policies, and backpropation.

And very, very recently Microsoft came out with this exciting paper called "rStar-Math: Small LLMs Can Master Math Reasoningwith Self-Evolved Deep Thinking." The paper shows how a small model could be trained to rival the metrics of the OpenAI o1 model in math. It does this by "exercising deep thinking” through Monte Carlo Tree Search (MCTS)" - once again showing the importance of MCTS in AI.

MCTS with its flexibility, probabilistic nature, and adaptability to domain specific heuristics, excels in both the typical uses of traditional games and atypically in complex real world problems making it an important part of AI innovation and optimizing diverse systems.

No comments:

Post a Comment

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...