- Selection
- Expansion
- Simulation
- Backpropagation
\( c \) = constant controlling the exploration-exploitation balance
\( N \) = total number of visits to the current node
\( n \) = number of visits to a child node
Here a larger \( c \) encourages more exploration of unexplored moves. The goal of this formula is to favor actions that have high average rewards while still allocating some attempts to actions with few visits (because they might lead to surprising or better results once explored more thoroughly). This is how we get moves that might look to a bystander as "unusual" or "non-standard." After reaching a leaf node, either a node that has not been explored yet, or one that is not fully expanded with potential child nodes, the algorithm proceeds to the expansion phase. In this stage, it creates one or more child nodes corresponding to unvisited actions. This expansion step gradually grows the search tree over time, ensuring that new parts of the state space are discovered rather than staying confined to what has already been visited. The third stage, simulation, or you will often hear it referred to as the playout, is where the “Monte Carlo” part comes into play. From one of the newly added child nodes, MCTS simulates a sequence of moves (often random or guided by a light heuristic) all the way to a terminal outcome. For example, in a game, that outcome might be a win, a loss, or a draw. In a more general planning or scheduling context, it could be a success or failure, or a particular reward value. These random playouts, repeated many times, serve as samples of what might realistically happen if a particular action path is chosen. Once the simulation is complete, the algorithm performs backpropagation to propagate the result of the simulation back up the tree to the root. Along the path taken in the search tree, each node’s statistics - like how many simulations passed through that node, how many ended in a favorable outcome, or the average return are updated. Over many iterations, these accumulated statistics become increasingly meaningful, revealing which branches of the search tree appear promising and which ones do not. MCTS is successful in many applications because it doesn’t require a specially devised evaluation function for intermediate states. Given enough playouts, the sampling can approximate the true value of each action. This versatility is why MCTS has been popular in board games like Go, where it formed the backbone of AlphaGo’s decision making system. It is also employed in video game AI, robotics for path planning, and even scheduling problems; anywhere the decision space is complex and uncertain. Yet, MCTS is not without challenges. Its effectiveness depends significantly on the quality of the playouts - because remember these playouts are taken place under a constraint, which is often a time limit. If the simulations are purely random in a very large or complex domain, they might produce misleading estimates. For this reason, many implementations add a small amount of domain knowledge or heuristics to steer simulations toward more likely or more relevant outcomes (I will do this in the Connect 4 example below). Another concern is the size of the search tree itself: while MCTS is more scalable than exhaustive methods, it can still blow up in expansive problem spaces if it does not prune or limit growth effectively. Implementation details such as how states are represented, how child nodes are stored, and how results are backpropagated can significantly impact its performance. In practice, the so-called “anytime” property of MCTS is one of its greatest strengths. You can let the algorithm run for as many or as few iterations as time allows, and it will offer the best solution it has found up to that point. Longer run times translate into more simulations, deeper exploration, and more refined estimates of action quality, but even with limited time, MCTS can generate a decent approximation of the optimal move. So by blending random sampling with incremental, iterative updates to a tree of possibilities, MCTS circumvents the need for elaborate heuristics or full look-ahead. Through repeated applications of selection, expansion, simulation, and backpropagation, MCTS transforms what might otherwise be an intractable search into a manageable process, making it a "go to" strategy for complex game spaces like Go, Chess, and Othelo. Before moving on to the Connect 4 code, I should mention the algorithm of minimax search with alpha/beta pruning. Historically, minimax has been the "go to" algorithm for games like Chess and the best Chess engines in the world have been built around minimax like Stockfish. Although recently a MCTS type Chess engine named Leela has competed very well with Stockfish (although technically it's not a pure MCTS - it uses a neural network to simulate the rollouts). Minimax with alpha/beta pruning is a very deterministic search. It systematically explores the game tree to a certain depth (or until terminal states). Typically, games like Chess with large branching factors must limit search depth. At the leaf nodes (or at the cutoff depth), an evaluation function (heuristic) estimates the position’s value, which is where pruning comes in. Pruning uses α (alpha) and β (beta) bounds to prune branches that cannot affect the final minimax decision. The main difference between MCTS and minimax is that MCTS is probablistic and minimax is deterministic. Minimax is effective in games with reasonable branching factor like Chess and Checkers. But with games like Go, the number of states grows exponentially with depth of search. So based on this discussion, you're probably thinking that minimax would be the best algorithm for Connect 4, since search space of possible moves and the depth of search is very limited - and you would be right. And certainly anyone building a game of Connect 4 would almost automatically use a minimax type algorithm. But with the following code using MCTS, I'm showing MCTS can do as well as minimax given a time constraint.
The Github repository can be found here. The most important function for the implementation is in the mcts.py file in a function unsurprisingly called mcts. This function steps through the four stages I outlined above of selection, expansion, simulation, and backpropagation. It iteratively does this under a time constraint. I have set my time constraint for 1 second - it has to come up with a move in under 1 second. In the selection stage, it is looking for the child node with the best UCT value. Once that node is found, the expansion stage expands that node by taking one untried move and creating a child node. Then the simulation stage simulates a random game of moves (rollout) from that child node until there is a winner or a draw. The backpropagation stage collects the simulation results up the nodes in that tree path. Once the while loop is finished, the node with the highest node count is the move that will be taken (which in the case of Connect 4 is the column that the piece will be dropped).
When the computer is looking to make its move, it doesn't automatically call this mcts function to get its next move. As I mentioned above, many games will include heuristics to steer the decision to look for a forced Connect 4 win or to block the opponent's forced win. Chess engines will incorporate special case checks for forced checkmates or forced draws and then do their algorithm search like the minimax. Similarly, I created a find_immediate_win_or_blockade function so as to not miss a forced win or loss that it will look at before doing the mcts function.
So this function will look to make sure the computer doesn't miss a move before it does its time budgeted move search:
One more function I think I need to explain looked like the code below. This ai_move function would call the find_immediate_win_or_block function to look for an immediate win or block and if there wasn't an immediate win or block would then call the mcts function to look for a column to drop its piece. However, when I gave the link to family and friends, they complained that they couldn't ever win. So I added a 1 through 5 difficulty setting that the user can set. And if you look at the updated Github code, level 5 will do this original code, but levels 1-4 will choose a probabilistic amount of times that it will not do the MCTS and instead it will just drop a piece randomly. There are other ways to handle difficulty level, such as lessening the search time or space, but for Connect 4, mixing in a random move seems to work well.
Initially, I wrote this Connect 4 code where the output was displayed in the terminal, but I wanted to share it with other people. But at the same time I wanted a minimalist implementation, so I did the unusual idea of doing it as a Streamlit application using emoticons. Then end result is here, which you can play.
def mcts(root_board, current_player, simulations=500, time_limit=1.0): """ Perform MCTS from the root state and return the column of the best move. """ start_time = time.time() root_node = MCTSNode(root_board, current_player) while (time.time() - start_time) < time_limit: # 1. Selection node = root_node while not node.untried_moves and node.children: node = best_child(node) # 2. Expansion if node.untried_moves: node = expand_node(node) # 3. Simulation winner = simulate_game(node.board, node.current_player) # 4. Backpropagation backpropagate(node, winner) # After time is up, pick the child with the highest visit count. best_move_node = max(root_node.children, key=lambda c: c.visits) if root_node.children else None if best_move_node is None: # fallback if somehow no children return random.choice(get_valid_moves(root_board)) # Return the column that leads to best_move_node for col in get_valid_moves(root_board): candidate_board = make_move(root_board, col, current_player) if candidate_board == best_move_node.board: return col return random.choice(get_valid_moves(root_board))
def find_immediate_win_or_block(board, current_player): """ Check if current_player can immediately win, or if the opponent can immediately win next turn (then block). """ # Immediate win for col in get_valid_moves(board): temp_board = make_move(board, col, current_player) if check_winner(temp_board, current_player): return col # Block opponent opponent = get_next_player(current_player) for col in get_valid_moves(board): temp_board = make_move(board, col, opponent) if check_winner(temp_board, opponent): return col return None
def ai_move(board, current_player, difficulty): """ Decide AI move: 1. Check immediate win/block 2. Otherwise use MCTS """ col = find_immediate_win_or_block(board, current_player) if col is not None: return col return mcts(board, current_player, simulations=300, time_limit=1.0)
Non-typical MCTS: Customer Journey
Now that we have an explanation and example code of the typical application of MCTS two player game, we can talk about an example of non-typical application. There are many, many possible applications, but let's just look at one, which is an e-commerce platform where the goal is to guide consumers through a personalized journey (e.g., product recommendations, promotional offers, and website interactions) to maximize conversions, repeat purchases, and customer satisfaction.
In this type of example we have:
- States in the customer journey:
- Customer demographics and preferences.
- Interaction history (e.g., clicks, time spent, abandoned carts).
- Contextual data (e.g., time of day, device type, or browsing session).
- Actions
- Recommending specific products.
- Sending a targeted promotional email or notification.
- Offering discounts or free shipping.
- Displaying upsell or cross-sell opportunities.
- Reward Function
- Conversion events (e.g., purchases or sign-ups).
- Total revenue or profit generated.
- Customer satisfaction and engagement metrics (e.g., session length, feedback).
- Simulations
- Rollouts simulate different sequences of actions to predict customer responses, using probabilistic models of consumer behavior (e.g., likelihood of purchase or churn)
- Balancing of Exploration and Exploitation
- MCTS balances exploring new strategies (e.g., trying novel product bundles or marketing messages) with exploiting known successful approaches (e.g., strategies that previously led to high-value purchases).
No comments:
Post a Comment