Chapter 4. Playing games with tree search

This chapter covers

  • Finding the best move with the minimax algorithm
  • Pruning minimax tree search to speed it up
  • Applying Monte Carlo tree search to games

Suppose you’re given two tasks. The first is to write a computer program that plays chess. The second is to write a program that plans how to efficiently pick orders in a warehouse. What could these programs have in common? At first glance, not much. But if you step back and think in abstract terms, you can see a few parallels:

  • You have a sequence of decisions to make. In chess, your decisions are about which piece to move. In the warehouse, your decisions are about which item to pick up next.
  • Early decisions can affect future decisions. In chess, moving a pawn early on may expose your queen to a counterattack many turns later. In the warehouse, if you go for a widget on shelf 17 first, you may need to backtrack all the way to shelf 99 later.
  • At the end of a sequence, you can evaluate how well you achieved your goal. In chess, when you reach the end of the game, you know who won. In the warehouse, you can add up the time it took to gather all the items.
  • The number of possible sequences can get enormous. There are around 10100 ways a chess game can unfold. In the warehouse, if you have 20 items to pick up, there are 2 billion possible paths to choose from.

Of course, this analogy only goes so far. In chess, for example, you alternate turns with an opponent who is actively trying to thwart your intentions. That doesn’t happen in any warehouse you’d want to work in.

In computer science, tree-search algorithms are strategies for looping over many possible sequences of decisions to find the one that leads to the best outcome. In this chapter, we cover tree-search algorithms as they apply to games; many of the principles can extend to other optimization problems as well. We start with the minimax search algorithm, in which you switch perspectives between two opposing players on each turn. The minimax algorithm can find perfect lines of play but is too slow to apply to sophisticated games. Next, we take a look at two techniques for getting a useful result while searching only a tiny fraction of the tree. One of these is pruning: you speed up the search by eliminating sections of the tree. To prune effectively, you need to bring real-world knowledge of the problem into your code. When that’s not possible, you can sometimes apply Monte Carlo tree search (MCTS). MCTS is a randomized search algorithm that can find a good result without any domain-specific code.

With these techniques in your toolkit, you can start building AIs that can play a variety of board and card games.

4.1. Classifying games

Tree-search algorithms are mainly relevant to games where you take turns, and a discrete set of options is available on each turn. Many board and card games fit this description. On the other hand, tree search won’t help a computer play basketball, charades, or World of Warcraft. We can further classify board and card games according to two useful properties:

  • Deterministic vs. nondeterministicIn a deterministic game, the course of the game depends only on the players’ decisions. In a nondeterministic game, an element of randomness is involved, such as rolling dice or shuffling cards.
  • Perfect information vs. hidden informationIn perfect information games, both players can see the full game state at all times; the whole board is visible, or everyone’s cards are on the table. In hidden information games, each player can see only part of the game state. Hidden information is common in card games, where each player is dealt a few cards and can’t see what the other players are holding. Part of the appeal of hidden information games is guessing what the other players know based on their game decisions.

Table 4.1 shows how several well-known games fit into this taxonomy.

Table 4.1. Classifying board and card games
 

Deterministic

Nondeterministic

Perfect information Go, chess Backgammon
Hidden information Battleship, Stratego Poker, Scrabble

In this chapter, we primarily focus on deterministic, perfect information games. On each turn of such games, one move is theoretically the best. There’s no luck and no secrets; before you choose a move, you know every move your opponent might choose in response, and everything you might do after that, and so on to the end of the game. In theory, you should have the whole game planned out on the first move. The minimax algorithm does exactly that in order to come up with the perfect play.

In reality, the games that have stood the test of time, such as chess and Go, have an enormous number of possibilities. To humans, every game seems to take on a life of its own, and even computers can’t calculate them all the way to the end.

All the examples in this chapter contain little game-specific logic, so you can adapt them to any deterministic, perfect information game. To do so, you can follow the pattern of our goboard module and implement your new game logic in classes such as Player, Move, and GameState. The essential functions for GameState are apply_move, legal_moves, is_over, and winner. We have done this for tic-tac-toe; you can find this in the ttt module on GitHub (http://mng.bz/gYPe).

Games for AI experiments

Need some inspiration? Look up the rules for any of the following games:

  • Chess
  • Checkers
  • Reversi
  • Hex
  • Chinese checkers
  • Mancala
  • Nine Men’s Morris
  • Gomoku

4.2. Anticipating your opponent with minimax search

How can you program a computer to decide what move to make next in a game? To start, you can think about how humans would make the same decision. Let’s start with the simplest deterministic, perfect information game there is: tic-tac-toe. The technical name for the strategy we’ll describe is minimaxing. This term is a contraction of minimizing and maximizing: you’re trying to maximize your score, while your opponent is trying to minimize your score. You can sum up the algorithm in one sentence: assume that your opponent is as smart as you are.

Let’s see how minimaxing works in practice. Take a look at figure 4.1. What move should X make next?

Figure 4.1. What move should × make next? This is an easy one: playing in the lower-right corner wins the game.

There’s no trick here; taking the lower-right corner wins the game. You can make that into a general rule: take any move that immediately wins the game. There’s no way this plan can go wrong. You could implement this rule in code with something like the following listing.

Listing 4.1. A function that finds a move that immediately wins the game
def find_winning_move(game_state, next_player):
  for candidate_move in game_state.legal_moves(next_player):       1
    next_state = game_state.apply_move(candidate_move)             2
    if next_state.is_over() and next_state.winner == next_player:
      return candidate_move                                        3
  return None                                                      4

  • 1 Loops over all legal moves
  • 2 Calculates what the board would look like if you pick this move
  • 3 This is a winning move! No need to continue searching.
  • 4 Can’t win on this turn

Figure 4.2 illustrates the hypothetical board positions this function would examine. This structure, in which a board position points to possible follow-ups, is called a game tree.

Figure 4.2. An illustration of an algorithm to find the winning move. You start with the position at the top. You loop over every possible move and calculate the game state that would result if you played that move. Then you check whether that hypothetical game state is a winning position for ×.

Let’s back up a bit. How did you get in this position? Perhaps the previous position looked like figure 4.3. The O player naively hoped to make three in a row across the bottom. But that assumes that X will cooperate with the plan. This gives a corollary to our previous rule: don’t choose any move that gives your opponent a winning move. Listing 4.2 implements this logic.

Figure 4.3. What move should O make next? If O plays in the lower left, you must assume that × will follow up in the lower right to win the game. O must find the only move that prevents this.

Listing 4.2. A function that avoids giving the opponent a winning move
def eliminate_losing_moves(game_state, next_player):
  opponent = next_player.other()
  possible_moves = []                                               1
  for candidate_move in game_state.legal_moves(next_player):        2
    next_state = game_state.apply_move(candidate_move)              3
    opponent_winning_move = find_winning_move(next_state, opponent) 4
    if opponent_winning_move is None:                               4
      possible_moves.append(candidate_move)                         4
  return possible_moves

  • 1 possible_moves will become a list of all moves worth considering.
  • 2 Loops over all legal moves
  • 3 Calculates what the board would look like if you play this move
  • 4 Does this give your opponent a winning move? If not, this move is plausible.

Now, you know that you must block your opponent from getting into a winning position. Therefore, you should assume that your opponent is going to do the same to you. With that in mind, how can you play to win? Take a look at the board in figure 4.4.

Figure 4.4. What move should × make? If × plays in the center, there will be two ways to complete three in a row: (1) top middle and (2) lower right. O can block only one of these options, so × is guaranteed a win.

If you play in the center, you have two ways to complete three in a row: top middle or lower right. The opponent can’t block them both. We can describe this general principle as follows: look for a move that sets up a subsequent winning move that your opponent can’t block. Sounds complicated, but it’s easy to build this logic on top of the functions you’ve already written.

Listing 4.3. A function that finds a two-move sequence that guarantees a win
def find_two_step_win(game_state, next_player):
  opponent = next_player.other()
  for candidate_move in game_state.legal_moves(next_player):        1
    next_state = game_state.apply_move(candidate_move)              2
    good_responses = eliminate_losing_moves(next_state, opponent)   3
    if not good_responses:                                          3
      return candidate_move                                         3
  return None                                                       4

  • 1 Loops over all legal moves
  • 2 Calculates what the board would look like if you play this move
  • 3 Does your opponent have a good defense? If not, pick this move.
  • 4 No matter what move you pick, your opponent can prevent a win.

Your opponent will anticipate that you’ll try to do this, and also try to block such a play. You can start to see a general strategy forming:

  1. See if you can win on the next move. If so, play that move.
  2. If not, see if your opponent can win on the next move. If so, block that.
  3. If not, see if you can force a win in two moves. If so, play to set that up.
  4. If not, see if your opponent could set up a two-move win on their next move.

Notice that all three of your functions have a similar structure. Each function loops over all valid moves and examines the hypothetical board position that you’d get after playing that move. Furthermore, each function builds on the previous function to simulate what your opponent would do in response. If you generalize this concept, you get an algorithm that can always identify the best possible move.

4.3. Solving tic-tac-toe: a minimax example

In the previous section, you examined how to anticipate your opponent’s play one or two moves ahead. In this section, we show how to generalize that strategy to pick perfect moves in tic-tac-toe. The core idea is exactly the same, but you need the flexibility to look an arbitrary number of moves in the future.

First let’s define an enum that represents the three possible outcomes of a game: win, loss, or draw. These possibilities are defined relative to a particular player: a loss for one player is a win for the other.

Listing 4.4. An enum to represent the outcome of a game
class GameResult(enum.Enum):
    loss = 1
    draw = 2
    win = 3

Imagine you have a function best_result that takes a game state and tells you the best outcome that a player could achieve from that state. If that player could guarantee a win—by any sequence, no matter how complicated—the best_result function would return GameResult.win. If that player could force a draw, the function would return GameResult.draw. Otherwise, it would return GameResult.loss. If you assume that function already exists, it’s easy to write a function to pick a move: you loop over all possible moves, call best_result, and pick the move that leads to the best result for you. Multiple moves might exist that lead to equal results; you can just pick randomly from them in that case. The following listing shows how to implement this.

Listing 4.5. A game-playing agent that implements minimax search
class MinimaxAgent(Agent):
    def select_move(self, game_state):
        winning_moves = []
        draw_moves = []
        losing_moves = []
        for possible_move in game_state.legal_moves():                    1
            next_state = game_state.apply_move(possible_move)             2
            opponent_best_outcome = best_result(next_state)               3
            our_best_outcome = reverse_game_result(opponent_best_outcome) 3
            if our_best_outcome == GameResult.win:                        4
                winning_moves.append(possible_move)                       4
            elif our_best_outcome == GameResult.draw:                     4
                draw_moves.append(possible_move)                          4
            else:                                                         4
                losing_moves.append(possible_move)                        4
        if winning_moves:                                                 5
            return random.choice(winning_moves)                           5
        if draw_moves:                                                    5
            return random.choice(draw_moves)                              5
        return random.choice(losing_moves)                                5

  • 1 Loops over all legal moves
  • 2 Calculates the game state if you select this move
  • 3 Because your opponent plays next, figure out their best possible outcome from there. Your outcome is the opposite of that.
  • 4 Categorizes this move according to its outcome
  • 5 Picks a move that leads to your best outcome

Now the question is how to implement best_result. As in the previous section, you can start from the end of the game and work backward. The following listing shows the easy case: if the game is already over, there’s only one possible result. You just return it.

Listing 4.6. First step of the minimax search algorithm
def best_result(game_state):
    if game_state.is_over():
        if game_state.winner() == game_state.next_player:
            return GameResult.win
        elif game_state.winner() is None:
            return GameResult.draw
        else:
            return GameResult.loss

If you’re somewhere in the middle of the game, you need to search ahead. By now, the pattern should be familiar. You start by looping over all possible moves and calculating the next game state. Then you must assume that your opponent will do their best to counter your hypothetical move. To do so, you can call best_result from this new position. That tells you the result that your opponent can get from the new position; you invert it to find out your result. Out of all the moves you consider, you select the one that leads to the best result for you. Listing 4.7 shows how to implement this logic, which makes up the second half of best_result. Figure 4.5 illustrates the board positions this function will consider for a particular tic-tac-toe board.

Figure 4.5. A tic-tac-toe game tree. In the top position, it’s ×’s turn. If × plays in the top center, O can guarantee a win. If × plays in the left center, × will win. If × plays in the right center, O can force a draw. Therefore, × will choose to play in the left center.

Listing 4.7. Implementing minimax search
    best_result_so_far = GameResult.loss
    for candidate_move in game_state.legal_moves():
        next_state = game_state.apply_move(candidate_move)       1
        opponent_best_result = best_result(next_state)           2
        our_result = reverse_game_result(opponent_best_result)   3
        if our_result.value > best_result_so_far.value:          4
            best_result_so_far = our_result
    return best_result_so_far

  • 1 See what the board would look like if you play this move.
  • 2 Find out your opponent’s best move.
  • 3 Whatever your opponent wants, you want the opposite.
  • 4 See if this result is better than the best you’ve seen so far.

If you apply this algorithm to a simple game such as tic-tac-toe, you get an unbeatable opponent. You can play against it and see for yourself: try the play_ttt.py example on GitHub (http://mng.bz/gYPe). In theory, this algorithm would also work for chess, Go, or any other deterministic, perfect information game. In reality, this algorithm is far too slow for any of those games.

4.4. Reducing search space with pruning

In our tic-tac-toe game, you calculated every single possible game in order to find the perfect strategy. There are fewer than 300,000 possible tic-tac-toe games, peanuts for a modern computer. Can you apply the same technique to more interesting games? There are around 500 billion billion (that’s a 5 followed by 20 zeros) possible board positions in checkers, for example. It’s technically possible to search them all on a cluster of modern computers, but it takes years. In chess and Go, there are more possible board positions than there are atoms in the universe (as their fans are quick to point out). Searching them all is out of the question.

To use tree search to play a sophisticated game, you need a strategy to eliminate parts of the tree. Identifying which parts of the tree you can skip is called pruning.

Game trees are two-dimensional: they have width and depth. The width is the number of possible moves from a given board position. The depth is the number of turns from a board position to a final game state—a possible end of the game. Within a game, both these quantities vary from turn to turn.

You generally estimate the size of the tree by thinking of the typical width and typical depth for a particular game. The number of board positions in a game tree is roughly given by the formula Wd, where W is the average width and d is the average depth. Figures 4.6 and 4.7 illustrate the width and depth of a tic-tac-toe game tree. For example, in chess, a player normally has about 30 options per move, and a game lasts about 80 moves; the size of the tree is something like 3080 ≈ 10118 positions. Go typically has 250 legal moves per turn, and a game might last 150 turns. This gives a game tree size of 250150 ≈ 10359 positions.

Figure 4.6. The width of a tic-tac-toe game tree: the maximum width is 9, because you have 9 possible options on the first move. But the number of legal moves decreases on each turn, so the average width is 4 or 5 moves.

Figure 4.7. The depth of a tic-tac-toe game tree: the maximum depth is 9 moves; after that, the board is full.

This formula, Wd, is an example of exponential growth: the number of positions to consider grows quickly as you increase the search depth. Imagine a game with an average width and depth of about 10. The full game tree would contain 1010, or 10 billion, board positions to search.

Now suppose you come up with modest pruning schemes. First, you figure out how to quickly eliminate two moves from consideration on each turn, reducing the effective width to 8. Second, you decide you can figure out the game result by looking just 9 moves ahead instead of 10. This leaves you 89 positions to search, or around 130 million. Compared to a full search, you’ve eliminated more than 98% of the computation! The key takeaway is that even slightly reducing the width or depth of your search can massively reduce the time required to select a move. Figure 4.8 illustrates the impact of pruning on a small tree.

Figure 4.8. Pruning can quickly shrink a game tree. This tree has width 4 and height 3, for a total of 64 leaves to examine. Suppose you find a way to eliminate 1 of the 4 possible options on each turn. Then you end up with just 27 leaves to visit.

In this section, we cover two pruning techniques: position evaluation functions for reducing the search depth, and alpha-beta pruning for reducing the search width. The two techniques work together to form the backbone of classical board-game AI.

4.4.1. Reducing search depth with position evaluation

If you follow a game tree all the way to the end of the game, you can calculate the winner. What about earlier in the game? Human players normally have a sense of who is leading throughout the midgame. Even beginner Go players have an instinctive feel for whether they’re bossing their opponent around or scrambling to survive. If you can capture that sense in a computer program, you can reduce the depth that you need to search. A function that mimics this sense of who is leading, and by how much, is a position evaluation function.

For many games, the position evaluation function can be handcrafted by using knowledge of the game. For example:

  • CheckersCount one point for each regular piece on the board, plus two points for each king. Take the value of your pieces and subtract the value of your opponent’s.
  • ChessCount one point for each pawn, three points for each bishop or knight, five points for each rook, and nine points for the queen. Take the value of your pieces and subtract the value of your opponent’s.

These evaluation functions are highly simplified; top checkers and chess engines use much more sophisticated heuristics. But in both cases, the AI will have the incentive to capture the opponent’s pieces and preserve its own. Furthermore, it’ll be willing to trade its weaker pieces to capture a stronger one.

In Go, the equivalent heuristic is to add up the stones you’ve captured and then subtract the number of stones your opponent has captured. (Equivalently, you can count the difference in the number of stones on the board.) Listing 4.8 calculates that heuristic. It turns out this isn’t an effective evaluation function. In Go, the threat of capturing stones is much more important than actually capturing them. It’s quite common for a game to last for 100 turns or more before any stones are captured. Crafting a board evaluation function that accurately captures the nuances of the game state turns out to be extremely difficult.

That said, you can use this too-simple heuristic for the purpose of illustrating pruning techniques. This isn’t going to produce a strong bot, but it’s better than picking moves completely at random. In chapters 11 and 12, we cover how to use deep learning to generate a better evaluation function.

After you’ve chosen an evaluation function, you can implement depth pruning. Instead of searching all the way to the end of the game and seeing who won, you search a fixed number of moves ahead and use the evaluation function to estimate who is more likely to win.

Listing 4.8. A highly simplified board evaluation heuristic for Go
def capture_diff(game_state):
    black_stones = 0
    white_stones = 0
    for r in range(1, game_state.board.num_rows + 1):
        for c in range(1, game_state.board.num_cols + 1):
            p = gotypes.Point(r, c)
            color = game_state.board.get(p)
            if color == gotypes.Player.black:
                black_stones += 1
            elif color == gotypes.Player.white:
                white_stones += 1
    diff = black_stones - white_stones                   1
    if game_state.next_player == gotypes.Player.black:   2
        return diff                                      2
    return -1 * diff                                     3

  • 1 Calculate the difference between the number of black stones and white stones on the board. This will be the same as the difference in the number of captures, unless one player passes early.
  • 2 If it’s black’s move, return (black stones) – (white stones).
  • 3 If it’s white’s move, return (white stones) – (black stones).

Figure 4.9 shows a partial game tree with depth pruning. (We’ve left most of the branches out of the diagram to save space, but the algorithm would examine those too.)

Figure 4.9. A partial Go game tree. Here you search the tree to a depth of 2 moves ahead. At that point, you look at the number of captures to evaluate the board. If black chooses the rightmost branch, white can capture a stone, yielding an evaluation of –1 for black. If black chooses the center branch, the black stone is safe (for now). That branch is evaluated at a score of 0. Therefore, black chooses the center branch.

In this tree, you’re searching to a depth of 2 moves ahead, and using the number of captured stones as the board evaluation function. The original position shows black to play; black has a stone with just one liberty. What should black do? If black extends straight down, as shown in the middle branch, the stone is safe (for now). If black plays anywhere else, white can capture the stone—the left branch shows one of the many ways that can happen.

After looking two moves ahead, you apply your evaluation function to the position. In this case, any branch where white captures the stone is scored at +1 for white and –1 for black. All other branches have a score of 0 (there’s no other way to capture a stone in two turns). In this case, black picks the only move that protects the stone.

Listing 4.9 shows how to implement depth pruning. The code looks similar to the full minimax code in listing 4.7: it may be helpful to compare them side by side. Note the differences:

  • Instead of returning a win/lose/draw enum, you return a number indicating the value of your board evaluation function. Our convention is that the score is from the perspective of the player who has the next turn: a large score means the player who has the next move expects to win. When you evaluate the board from your opponent’s perspective, you multiply the score by –1 to flip back to your perspective.
  • The max_depth parameter controls the number of moves you want to search ahead. At each turn, you subtract 1 from this value.
  • When max_depth hits 0, you stop searching and call your board evaluation function.
Listing 4.9. Depth-pruned minimax search
def best_result(game_state, max_depth, eval_fn):
    if game_state.is_over():                                 1
        if game_state.winner() == game_state.next_player:    1
            return MAX_SCORE                                 1
        else:                                                1
            return MIN_SCORE                                 1

    if max_depth == 0:                                       2
        return eval_fn(game_state)                           2

    best_so_far = MIN_SCORE
    for candidate_move in game_state.legal_moves():          3
        next_state = game_state.apply_move(candidate_move)   4
        opponent_best_result = best_result(                  5
            next_state, max_depth - 1, eval_fn)              5
        our_result = -1 * opponent_best_result               6
        if our_result > best_so_far:                         7
            best_so_far = our_result                         7

    return best_so_far

  • 1 If the game is already over, you know who the winner is.
  • 2 You’ve reached your maximum search depth. Use your heuristic to decide how good this sequence is.
  • 3 Loop over all possible moves.
  • 4 See what the board would look like if you play this move.
  • 5 Find the opponent’s best result from this position.
  • 6 Whatever your opponent wants, you want the opposite.
  • 7 See if this is better than the best result you’ve seen so far.

Feel free to experiment with your own evaluation functions. It can be fun to see how they affect your bot’s personality, and you can certainly do a bit better than our simple example.

4.4.2. Reducing search width with alpha-beta pruning

Look at the diagram in figure 4.10. It’s black’s move, and you’re considering playing on the point marked with a square. If you do so, white can then play at A to capture four stones. Clearly that’s a disaster for black! What if white replies at B instead? Well, who cares? White’s response at A is bad enough already. From black’s perspective, you don’t really care if A is the absolute best move white can pick. As soon as you find one powerful response, you can reject playing at the point marked with a square and move on to the next option. That’s the idea behind alpha-beta pruning.

Figure 4.10. The black player is considering playing at the point marked with a square. If black plays there, white can respond at A to capture four stones. That result is so bad for black that you can immediately reject playing on the square; there’s no need for black to consider other white responses, such as B.

Let’s walk through how the alpha-beta algorithm would apply to that position. Alpha-beta pruning starts like a regular depth-pruned tree search. Figure 4.11 shows the first step. You pick the first move to evaluate for black; that move is marked with an A in the diagram. Then you fully evaluate that move up to a depth of 3. You can see that no matter how white responds, black can capture at least two stones. So you evaluate this branch to a score of +2 for black.

Figure 4.11. Step 1 in alpha-beta pruning: you fully evaluate the first possible black move; this move is evaluated at a score of +2 for black. So far, the algorithm is exactly the same as the depth-pruned search covered in the previous section.

Now you consider the next candidate move for black, marked as B in figure 4.12. Just as in the depth-pruned search, you look at all possible white responses and evaluate them one by one. White can play in the upper-left corner to capture four stones; that branch gets evaluated at a score of –4 for black. Now, you already know that if black plays at A, black is guaranteed a score of at least +2. If black plays at B, you just saw how white can hold black to a score of –4; possibly white can do even better. But because –4 is already worse than +2, there’s no need to search further. You can skip evaluating a dozen other white responses—and each of those positions has many more combinations after it. The amount of computation you save adds up quickly, and you still select the exact same move that you would’ve chosen with a full search to depth 3.

Figure 4.12. Step 2 in alpha-beta pruning: you now evaluate the second possible black move. Here, white has a response that captures four stones. That branch evaluates to –4 for black. As soon as you evaluate that white response, you can discard this move for black entirely and skip the other possible white responses. It’s possible that white has an even better reply that you didn’t evaluate, but all you need to know is that playing at B is worse for black than playing at A.

For the purposes of this example, we chose a specific order to evaluate the moves to illustrate how the pruning works. Our actual implementation evaluates moves in the order of their board coordinates. The time savings you get by alpha-beta pruning depends on how quickly you find good branches. If you happen to evaluate the best branches early on, you can eliminate the other branches quickly. In the worst case, you evaluate the best branch last, and then alpha-beta pruning is no faster than a full depth-pruned search.

To implement the algorithm, you must track the best result so far for each player throughout your search. These values are traditionally called alpha and beta, which is where the name of the algorithm comes from. In our implementation, we call those values best_black and best_white.

Listing 4.10. Checking whether you can stop evaluating a branch
def alpha_beta_result(game_state, max_depth,
                      best_black, best_white, eval_fn):
...
        if game_state.next_player == Player.white:
            # Update our benchmark for white.
            if best_so_far > best_white:          1
                best_white = best_so_far          1
            outcome_for_black = -1 * best_so_far  2
            if outcome_for_black < best_black:    2
                return best_so_far                2

  • 1 Updates your benchmark for white
  • 2 You’re picking a move for white; it needs to be only strong enough to eliminate black’s previous move. As soon as you find something that defeats black’s best option, you can stop searching.

You can extend your implementation of depth pruning to include alpha-beta pruning as well. Listing 4.10 shows the key new addition. This block is implemented from white’s perspective; you need a similar block for black.

First, you check whether you need to update the best_white score. Next, you check whether you can stop evaluating moves for white. You do this by comparing the current score to the best score you’ve found for black in any branch. If white can hold black to a lower score, black won’t choose this branch; you don’t need to find the absolute best score.

The full implementation of alpha-beta pruning is shown in the following listing.

Listing 4.11. Full implementation of alpha-beta pruning
def alpha_beta_result(game_state, max_depth,
                      best_black, best_white, eval_fn):
    if game_state.is_over():                                   1
        if game_state.winner() == game_state.next_player:      1
            return MAX_SCORE                                   1
        else:                                                  1
            return MIN_SCORE                                   1

    if max_depth == 0:                                         2
        return eval_fn(game_state)                             2

    best_so_far = MIN_SCORE
    for candidate_move in game_state.legal_moves():            3
        next_state = game_state.apply_move(candidate_move)     4
        opponent_best_result = alpha_beta_result(              5
            next_state, max_depth - 1,                         5
            best_black, best_white,                            5
            eval_fn)                                           5
        our_result = -1 * opponent_best_result                 6

        if our_result > best_so_far:
            best_so_far = our_result                           7
        if game_state.next_player == Player.white:             7
            if best_so_far > best_white:                       8
                best_white = best_so_far                       8
            outcome_for_black = -1 * best_so_far               9
            if outcome_for_black < best_black:                 9
                return best_so_far                             9
        elif game_state.next_player == Player.black:
            if best_so_far > best_black:                       10
                best_black = best_so_far                       10
            outcome_for_white = -1 * best_so_far               11
            if outcome_for_white < best_white:                 11
                return best_so_far                             11

    return best_so_far

  • 1 Check if the game is already over.
  • 2 You’ve reached your maximum search depth. Use your heuristic to decide how good this sequence is.
  • 3 Loop over all valid moves.
  • 4 See what the board would look like if you play this move.
  • 5 Find out your opponent’s best result from that position.
  • 6 Whatever your oppo-nent wants, you want the opposite.
  • 7 See whether this result is better than the best you’ve seen so far.
  • 8 Update your benchmark for white.
  • 9 You’re picking a move for white; it needs to be only strong enough to eliminate black’s previous move.
  • 10 Update your benchmark for black.
  • 11 You’re picking a move for black; it needs to be only strong enough to eliminate white’s previous move.

4.5. Evaluating game states with Monte Carlo tree search

For alpha-beta pruning, you used a position evaluation function to help reduce the number of positions you had to consider. But position evaluation in Go is very, very hard: your simple heuristic based on captures won’t fool many Go players. Monte Carlo tree search (MCTS) provides a way to evaluate a game state without any strategic knowledge about the game. Instead of a game-specific heuristic, the MCTS algorithm simulates random games to estimate how good a position is. One of these random games is called a rollout or a playout. In this book, we use the term rollout.

Monte Carlo tree search is part of the larger family of Monte Carlo algorithms, which use randomness to analyze extremely complex situations. The element of randomness inspired the name, an allusion to the famous casino district in Monaco.

It may seem impossible that you can build a good strategy out of picking random moves. A game AI that chooses completely random moves is going to be extremely weak, of course. But when you pit two random AIs against each other, the opponent is equally clueless. If black consistently wins more often than white, it must be because black started with an advantage. Therefore, you can figure out whether a position gives one player an advantage by starting random games from there. And you don’t need any understanding of why the position is good to do this.

It’s possible to get unbalanced results by chance. If you simulate 10 random games and white wins seven, how confident can you be that white had an advantage? Not very: white has won only two more games than you’d expect by chance. If black and white were perfectly balanced, there’s about a 30% chance of seeing a 7-out-of-10 result. On the other hand, if white wins 70 out of 100 random games, you can be virtually certain that the starting position did favor white. The key idea is that your estimate gets more accurate as you do more rollouts.

Each round of the MCTS algorithm takes three steps:

  1. Add a new board position to the MCTS tree.
  2. Simulate a random game from that position.
  3. Update the tree statistics with the results of that random game.

You repeat this process as many times as you can in the time available. Then the statistics at the top of the tree tell you which move to pick.

Let’s step through a single round of the MCTS algorithm. Figure 4.13 shows an MCTS tree. At this point in the algorithm, you’ve already completed a number of rollouts and built up a partial tree. Each node tracks the counts of who won the rollouts that started from any board position after that node. Every node’s count includes the sum of all its children. (Normally, the tree would have many more nodes at this point; in the diagram, we’ve omitted many of the nodes to save space.)

Figure 4.13. An MCTS game tree. The top of the tree represents the current board position; you’re trying to find the next move for black. At this point in the algorithm, you’ve performed 70 random rollouts from various possible positions. Each node tracks statistics on all the rollouts that started from any of its children.

At each round, you add a new game position to the tree. First, you pick a node at the bottom of the tree (a leaf) where you want to add a new child. This tree has five leaves. To get the best results, you need to be a little careful about the way you pick a leaf; section 4.5.2 covers a good strategy for doing so. For now, just suppose that you walked all the way down the leftmost branches. From that point, you randomly pick the next move, calculate the new board position, and add that node to the tree. Figure 4.14 shows what the tree looks like after that process.

Figure 4.14. Adding a new node to an MCTS tree. Here you select the leftmost branch as the place to insert a new node. You then randomly select the next move from the position to create the new node in the tree.

The new node in the tree is the starting point for a random game. You simulate the rest of the game, literally just selecting any legal play at each turn, until the game is over. Then you count up the score and find the winner. In this case, let’s suppose the winner is white. You record the result of this rollout in the new node. In addition, you walk up to all the node’s ancestors and add the new rollout to their counts as well. Figure 4.15 shows what the tree looks like after this step is complete.

Figure 4.15. Updating MCTS after a new rollout. In this scenario, the rollout is a win for white. You add that win into the new tree node and all its parent nodes.

That whole process is a single round of MCTS. Every time you repeat it, the tree gets bigger, and the estimates at the top get more accurate. Normally, you stop after a fixed number of rounds or after a fixed amount of time passes. At that point, you select the move that has the highest winning percentage.

4.5.1. Implementing Monte Carlo tree search in Python

Now that you’ve walked through the MCTS algorithm, let’s look at the implementation details. First, you’ll design a data structure to represent the MCTS tree. Next, you’ll write a function to execute the MCTS rollouts.

As shown in listing 4.12, you start by defining a new class, MCTSNode, to represent any node in your tree. Each MCTSNode will track the following properties:

  • game_state—The current state of the game (board position and next player) at this node in the tree.
  • parent—The parent MCTSNode that led to this one. You can set parent to None to indicate the root of the tree.
  • move—The last move that directly led to this node.
  • children—A list of all child nodes in the tree.
  • win_counts and num_rollouts—Statistics about the rollouts that started from this node.
  • unvisited_moves—A list of all legal moves from this position that aren’t yet part of the tree. Whenever you add a new node to the tree, you pull one move out of unvisited_moves, generate a new MCTSNode for it, and add it to the children list.
Listing 4.12. A data structure to represent an MCTS tree
class MCTSNode(object):
    def __init__(self, game_state, parent=None, move=None):
        self.game_state = game_state
        self.parent = parent
        self.move = move
        self.win_counts = {
            Player.black: 0,
            Player.white: 0,
        }
        self.num_rollouts = 0
        self.children = []
        self.unvisited_moves = game_state.legal_moves()

An MCTSNode can be modified in two ways. You can add a new child to the tree, and you can update its rollout stats. The following listing shows both functions.

Listing 4.13. Methods to update a node in an MCTS tree
    def add_random_child(self):
        index = random.randint(0, len(self.unvisited_moves) - 1)
        new_move = self.unvisited_moves.pop(index)
        new_game_state = self.game_state.apply_move(new_move)
        new_node = MCTSNode(new_game_state, self, new_move)
        self.children.append(new_node)
        return new_node

    def record_win(self, winner):
        self.win_counts[winner] += 1
        self.num_rollouts += 1

Finally, you add three convenience methods to access useful properties of your tree node:

  • can_add_child reports whether this position has any legal moves that haven’t yet been added to the tree.
  • is_terminal reports whether the game is over at this node; if so, you can’t search any further from here.
  • winning_frac returns the fraction of rollouts that were won by a given player.

These functions are implemented in the following listing.

Listing 4.14. Helper methods to access useful MCTS tree properties
    def can_add_child(self):
        return len(self.unvisited_moves) > 0

    def is_terminal(self):
        return self.game_state.is_over()

    def winning_frac(self, player):
        return float(self.win_counts[player]) / float(self.num_rollouts)

Having defined the data structure for the tree, you can now implement the MCTS algorithm. You start by creating a new tree. The root node is the current game state. Then you repeatedly generate rollouts. In this implementation, you loop for a fixed number of rounds for each turn; other implementations may run for a specific length of time instead.

Each round begins by walking down the tree until you find a node where you can add a child (any board position that has a legal move that isn’t yet in the tree). The select_move function hides the work of choosing the best branch to explore; we cover the details in the next section.

After you find a suitable node, you call add_random_child to pick any follow-up move and bring it into the tree. At this point, node is a newly created MCTSNode that has zero rollouts.

You now start a rollout from this node by calling simulate_random_game. The implementation of simulate_random_game is identical to the bot_v_bot example covered in chapter 3.

Finally, you update the win counts of the newly created node and all its ancestors. This whole process is implemented in the following listing.

Listing 4.15. The MCTS algorithm
class MCTSAgent(agent.Agent):
    def select_move(self, game_state):
        root = MCTSNode(game_state)

        for i in range(self.num_rounds):
            node = root
            while (not node.can_add_child()) and (not node.is_terminal()):
                node = self.select_child(node)

            if node.can_add_child():                             1
                node = node.add_random_child()                   1

            winner = self.simulate_random_game(node.game_state)  2

            while node is not None:                              3
                node.record_win(winner)                          3
                node = node.parent                               3

  • 1 Adds a new child node into the tree
  • 2 Simulates a random game from this node
  • 3 Propagates the score back up the tree

After you’ve completed the allotted rounds, you need to pick a move. To do so, you just loop over all the top-level branches and pick the one with the best winning percentage. The following listing shows how to implement this.

Listing 4.16. Selecting a move after completing your MCTS rollouts
class MCTSAgent:
...
    def select_move(self, game_state):
...
        best_move = None
        best_pct = -1.0
        for child in root.children:
            child_pct = child.winning_pct(game_state.next_player)
            if child_pct > best_pct:
                best_pct = child_pct
                best_move = child.move
        return best_move

4.5.2. How to select which branch to explore

Your game AI has a limited amount of time to spend on each turn, which means you can perform only a fixed number of rollouts. Each rollout improves your evaluation of a single possible move. Think of your rollouts as a limited resource: if you spend an extra rollout on move A, you have to spend one fewer rollout on move B. You need a strategy to decide how to allocate your limited budget. The standard strategy is called the upper confidence bound for trees, or UCT formula. The UCT formula strikes a balance between two conflicting goals.

The first goal is to spend your time looking at the best moves. This goal is called exploitation (you want to exploit any advantage that you’ve discovered so far). You’d spend more rollouts on the moves with the highest estimated winning percentage. Now, some of those moves have a high winning percentage just by chance. But as you complete more rollouts through those branches, your estimates get more accurate. The false positives will drop lower down the list.

On the other hand, if you’ve visited a node only a few times, your estimate may be way off. By sheer chance, you may have a low estimate for a move that’s really good. Spending a few more rollouts there may reveal its true quality. So your second goal is to get more accurate evaluations for the branches you’ve visited the least. This goal is called exploration.

Figure 4.16 compares a search tree biased toward exploitation against a tree biased toward exploration. The exploitation-exploration trade-off is a common feature of trial-and-error algorithms. It’ll come up again when we look at reinforcement learning later in the book.

Figure 4.16. The exploitation-exploration trade-off. In both game trees, you’ve visited seven board positions. On top, the search is biased more toward exploitation: the tree is deeper for the most promising moves. On the bottom, the search is biased more toward exploration: it tries more moves, but to less depth.

For each node you’re considering, you compute the winning percentage w to represent the exploitation goal. To represent exploration, you compute, where N is the total number of rollouts, and n is the number of rollouts that started with the node under consideration. This specific formula has a theoretical basis; for our purposes, just note that its value will be largest for the nodes you’ve visited the least.

You combine these two components to get the UCT formula:

Here, c is a parameter that represents your preferred balance between exploitation and exploration. The UCT formula gives you a score for each node, and the node with the highest UCT score is the start of the next rollout.

With a larger value of c, you’ll spend more time visiting the least-explored nodes. With a smaller value of c, you’ll spend more time gathering a better evaluation of the most promising node. The choice of c that makes the most effective game player is usually found via trial and error. We suggest starting somewhere around 1.5 and experimenting from there. The parameter c is sometimes called the temperature. When the temperature is “hotter,” your search will be more volatile, and when the temperature is “cooler,” your search will be more focused.

Listing 4.17 shows how to implement this policy. After you’ve identified the metric you want to use, selecting a child is a simple matter of calculating the formula for each node and choosing the node with the largest value. Just as in minimax search, you need to switch your perspective on each turn. You calculate the winning percentage from the point of view of the player who’d pick the next move, so that perspective alternates between black and white as you walk down the tree.

Listing 4.17. Selecting a branch to explore with the UCT formula
def uct_score(parent_rollouts, child_rollouts, win_pct, temperature):
    exploration = math.sqrt(math.log(parent_rollouts) / child_rollouts)
    return win_pct + temperature * exploration

class MCTSAgent:
...
    def select_child(self, node):
        total_rollouts = sum(child.num_rollouts for child in node.children)

        best_score = -1
        best_child = None
        for child in node.children:
            score = uct_score(
                total_rollouts,
                child.num_rollouts,
                child.winning_pct(node.game_state.next_player),
                self.temperature)
            if score > best_score:
                best_score = uct_score
                best_child = child
        return best_child

4.5.3. Applying Monte Carlo tree search to Go

In the previous section, you implemented the general form of the MCTS algorithm. Straightforward MCTS implementations can reach the amateur 1 dan level for Go, the level of a strong amateur player. Combining MCTS with other techniques can produce a bot that’s a fair bit stronger than that; many of the top Go AIs today use both MCTS and deep learning. If you’re interested in reaching a competitive level with your MCTS bot, this section covers some of the practical details to consider.

Fast code makes a strong bot

MCTS starts to be a viable strategy for full-size (19 × 19) Go at around 10,000 rollouts per turn. The implementation in this chapter isn’t fast enough to do that: you’ll be waiting several minutes for it to choose each move. You’ll need to optimize your implementation a bit in order to complete that many rollouts in a reasonable amount of time. On small boards, on the other hand, even your reference implementation makes a fun opponent.

All else being equal, more rollouts means a better decision. You can always make your bot stronger just by speeding up the code so as to squeeze more rollouts in the same amount of time. It’s not just the MCTS-specific code that’s relevant. The code that calculates captures, for example, is called hundreds of times per rollout. All the basic game logic is fair game for optimization.

Better rollout policies make better evaluations

The algorithm for selecting moves during random rollouts is called the rollout policy. The more realistic your rollout policy is, the more accurate your evaluations will be. In chapter 3, you implemented a RandomAgent that plays Go; in this chapter, you used your RandomAgent as your rollout policy. But it’s not quite true that the RandomAgent chooses moves completely randomly with no Go knowledge at all. First, you programmed it not to pass or resign before the board is full. Second, you programmed it not to fill its own eyes, so it wouldn’t kill its own stones at the end of the game. Without this logic, the rollouts would be less accurate.

Some MCTS implementations go further and implement more Go-specific logic in their rollout policy. Rollouts with game-specific logic are sometimes called heavy rollouts; by contrast, rollouts that are close to purely random are sometimes called light rollouts.

One way to implement heavy rollouts is to build up a list of basic tactical shapes that are common in Go, along with a known response. Anywhere you find a known shape on the board, you look up the known response and boost its probability of being selected. You don’t want to always pick the known response as a hard-and-fast rule; that will remove the vital element of randomness from the algorithm.

One example is in figure 4.17. This is a 3 × 3 local pattern in which a black stone is in danger of getting captured on white’s next turn. Black can save it, at least temporarily, by extending. This isn’t always the best move; it’s not even always a good move. But it’s more likely to be a good move than any random point on the board.

Figure 4.17. An example of a local tactical pattern. When you see the shape on the left, you should consider the response on the right. A policy that follows tactical patterns like this one won’t be especially strong, but will be much stronger than choosing moves completely at random.

Building up a good list of these patterns requires some knowledge of Go tactics. If you’re curious about other tactical patterns that you can use in heavy rollouts, we suggest looking at the source code for Fuego (http://fuego.sourceforge.net/) or Pachi (https://github.com/pasky/pachi), two open source MCTS Go engines.

Be careful when implementing heavy rollouts. If the logic in your rollout policy is slow to compute, you can’t execute as many rollouts. You may end up wiping out the gains of the more sophisticated policy.

A polite bot knows when to resign

Making a game AI isn’t just an exercise in developing the best algorithm. It’s also about creating a fun experience for the human opponent. Part of that fun comes from giving the human player the satisfaction of winning. The first Go bot you implemented in this book, the RandomAgent, is maddening to play against. After the human player inevitably pulls ahead, the random bot insists on continuing until the entire board is full. Nothing is stopping the human player from walking away and mentally scoring the game as a win. But this somehow feels unsporting. It’s a much better experience if your bot can gracefully resign instead.

You can easily add human-friendly resignation logic on top of a basic MCTS implementation. The MCTS algorithm computes an estimated winning percentage in the process of selecting a move. Within a single turn, you compare these numbers to decide what move to pick. But you can also compare the estimated winning percentage at different points in the same game. If these numbers are dropping, the game is tilting in the human’s favor. When the best option carries a sufficiently low winning percentage, say 10%, you can make your bot resign.

4.6. Summary

  • Tree-search algorithms evaluate many possible sequences of decisions to find the best one. Tree search comes up in games as well as general optimization problems.
  • The variant of tree search that applies to games is minimax tree search. In minimax search, you alternate between two players with opposing goals.
  • Full minimax tree search is practical in only extremely simple games (for example, tic-tac-toe). To apply it to sophisticated games (such as chess or Go), you need to reduce the size of the tree that you search.
  • A position evaluation function estimates which player is more likely to win from a given board position. With a good position evaluation function, you don’t have to search all the way to the end of a game in order to make a decision. This strategy is called depth pruning.
  • Alpha-beta pruning reduces the number of moves you need to consider at each turn, making it practical for games as complex as chess. The idea of alpha-beta pruning is intuitive: when evaluating a possible move, if you find a single strong countermove from your opponent, you can immediately drop that move from consideration entirely.
  • When you don’t have a good position evaluation heuristic, you can sometimes use Monte Carlo tree search. This algorithm simulates random games from a particular position and tracks which player wins more often.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset