- Selection
- Expansion
- Simulation
- Backpropagation
\( c \) = constant controlling the exploration-exploitation balance
\( N \) = total number of visits to the current node
\( n \) = number of visits to a child node
Here a larger \( c \) encourages more exploration of unexplored moves. The goal of this formula is to favor actions that have high average rewards while still allocating some attempts to actions with few visits (because they might lead to surprising or better results once explored more thoroughly). This is how we get moves that might look to a bystander as "unusual" or "non-standard." After reaching a leaf node, either a node that has not been explored yet, or one that is not fully expanded with potential child nodes, the algorithm proceeds to the expansion phase. In this stage, it creates one or more child nodes corresponding to unvisited actions. This expansion step gradually grows the search tree over time, ensuring that new parts of the state space are discovered rather than staying confined to what has already been visited. The third stage, simulation, or you will often hear it referred to as the playout, is where the “Monte Carlo” part comes into play. From one of the newly added child nodes, MCTS simulates a sequence of moves (often random or guided by a light heuristic) all the way to a terminal outcome. For example, in a game, that outcome might be a win, a loss, or a draw. In a more general planning or scheduling context, it could be a success or failure, or a particular reward value. These random playouts, repeated many times, serve as samples of what might realistically happen if a particular action path is chosen. Once the simulation is complete, the algorithm performs backpropagation to propagate the result of the simulation back up the tree to the root. Along the path taken in the search tree, each node’s statistics - like how many simulations passed through that node, how many ended in a favorable outcome, or the average return are updated. Over many iterations, these accumulated statistics become increasingly meaningful, revealing which branches of the search tree appear promising and which ones do not. MCTS is successful in many applications because it doesn’t require a specially devised evaluation function for intermediate states. Given enough playouts, the sampling can approximate the true value of each action. This versatility is why MCTS has been popular in board games like Go, where it formed the backbone of AlphaGo’s decision making system. It is also employed in video game AI, robotics for path planning, and even scheduling problems; anywhere the decision space is complex and uncertain. Yet, MCTS is not without challenges. Its effectiveness depends significantly on the quality of the playouts - because remember these playouts are taken place under a constraint, which is often a time limit. If the simulations are purely random in a very large or complex domain, they might produce misleading estimates. For this reason, many implementations add a small amount of domain knowledge or heuristics to steer simulations toward more likely or more relevant outcomes (I will do this in the Connect 4 example below). Another concern is the size of the search tree itself: while MCTS is more scalable than exhaustive methods, it can still blow up in expansive problem spaces if it does not prune or limit growth effectively. Implementation details such as how states are represented, how child nodes are stored, and how results are backpropagated can significantly impact its performance. In practice, the so-called “anytime” property of MCTS is one of its greatest strengths. You can let the algorithm run for as many or as few iterations as time allows, and it will offer the best solution it has found up to that point. Longer run times translate into more simulations, deeper exploration, and more refined estimates of action quality, but even with limited time, MCTS can generate a decent approximation of the optimal move. So by blending random sampling with incremental, iterative updates to a tree of possibilities, MCTS circumvents the need for elaborate heuristics or full look-ahead. Through repeated applications of selection, expansion, simulation, and backpropagation, MCTS transforms what might otherwise be an intractable search into a manageable process, making it a "go to" strategy for complex game spaces like Go, Chess, and Othelo. Before moving on to the Connect 4 code, I should mention the algorithm of minimax search with alpha/beta pruning. Historically, minimax has been the "go to" algorithm for games like Chess and the best Chess engines in the world have been built around minimax like Stockfish. Although recently a MCTS type Chess engine named Leela has competed very well with Stockfish (although technically it's not a pure MCTS - it uses a neural network to simulate the rollouts). Minimax with alpha/beta pruning is a very deterministic search. It systematically explores the game tree to a certain depth (or until terminal states). Typically, games like Chess with large branching factors must limit search depth. At the leaf nodes (or at the cutoff depth), an evaluation function (heuristic) estimates the position’s value, which is where pruning comes in. Pruning uses α (alpha) and β (beta) bounds to prune branches that cannot affect the final minimax decision. The main difference between MCTS and minimax is that MCTS is probablistic and minimax is deterministic. Minimax is effective in games with reasonable branching factor like Chess and Checkers. But with games like Go, the number of states grows exponentially with depth of search. So based on this discussion, you're probably thinking that minimax would be the best algorithm for Connect 4, since search space of possible moves and the depth of search is very limited - and you would be right. And certainly anyone building a game of Connect 4 would almost automatically use a minimax type algorithm. But with the following code using MCTS, I'm showing MCTS can do as well as minimax given a time constraint.
def mcts(root_board, current_player, simulations=500, time_limit=1.0): """ Perform MCTS from the root state and return the column of the best move. """ start_time = time.time() root_node = MCTSNode(root_board, current_player) while (time.time() - start_time) < time_limit: # 1. Selection node = root_node while not node.untried_moves and node.children: node = best_child(node) # 2. Expansion if node.untried_moves: node = expand_node(node) # 3. Simulation winner = simulate_game(node.board, node.current_player) # 4. Backpropagation backpropagate(node, winner) # After time is up, pick the child with the highest visit count. best_move_node = max(root_node.children, key=lambda c: c.visits) if root_node.children else None if best_move_node is None: # fallback if somehow no children return random.choice(get_valid_moves(root_board)) # Return the column that leads to best_move_node for col in get_valid_moves(root_board): candidate_board = make_move(root_board, col, current_player) if candidate_board == best_move_node.board: return col return random.choice(get_valid_moves(root_board))
def find_immediate_win_or_block(board, current_player): """ Check if current_player can immediately win, or if the opponent can immediately win next turn (then block). """ # Immediate win for col in get_valid_moves(board): temp_board = make_move(board, col, current_player) if check_winner(temp_board, current_player): return col # Block opponent opponent = get_next_player(current_player) for col in get_valid_moves(board): temp_board = make_move(board, col, opponent) if check_winner(temp_board, opponent): return col return None
def ai_move(board, current_player, difficulty): """ Decide AI move: 1. Check immediate win/block 2. Otherwise use MCTS """ col = find_immediate_win_or_block(board, current_player) if col is not None: return col return mcts(board, current_player, simulations=300, time_limit=1.0)
- States in the customer journey:
- Customer demographics and preferences.
- Interaction history (e.g., clicks, time spent, abandoned carts).
- Contextual data (e.g., time of day, device type, or browsing session).
- Actions
- Recommending specific products.
- Sending a targeted promotional email or notification.
- Offering discounts or free shipping.
- Displaying upsell or cross-sell opportunities.
- Reward Function
- Conversion events (e.g., purchases or sign-ups).
- Total revenue or profit generated.
- Customer satisfaction and engagement metrics (e.g., session length, feedback).
- Simulations
- Rollouts simulate different sequences of actions to predict customer responses, using probabilistic models of consumer behavior (e.g., likelihood of purchase or churn)
- Balancing of Exploration and Exploitation
- MCTS balances exploring new strategies (e.g., trying novel product bundles or marketing messages) with exploiting known successful approaches (e.g., strategies that previously led to high-value purchases).