Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

This chapter covers

Playing games with a variation on Monte Carlo tree search
Integrating tree search into self-play for reinforcement learning
Training a neural network to enhance a tree-search algorithm

After DeepMind revealed the second edition of AlphaGo, code-named Master, Go fans all over the world scrutinized its shocking style of play. Master’s games were full of surprising new moves. Although Master was bootstrapped from human games, it was continuously enhanced with reinforcement learning, and that enabled it to discover new moves that humans didn’t play.

This led to an obvious question: what if AlphaGo didn’t rely on human games at all, but instead learned entirely using reinforcement learning? Could it still reach a superhuman level, or would it get stuck playing with beginners? Would it rediscover patterns played by human masters, or would it play in an incomprehensible new alien style? All these questions were answered when AlphaGo Zero (AGZ) was announced in 2017.

AlphaGo Zero was built on an improved reinforcement-learning system, and it trained itself from scratch without any input from human games. Although its first games were worse than any human beginner’s, AGZ improved steadily and quickly surpassed every previous edition of AlphaGo.

To us, the most astonishing thing about AlphaGo Zero is how it does more with less. In many ways, AGZ is much simpler than the original AlphaGo. No more handcrafted feature planes. No more human game records. No more Monte Carlo rollouts. Instead of two neural networks and three training processes, AlphaGo Zero used one neural network and one training process.

And yet AlphaGo Zero was stronger than the original AlphaGo! How is that possible?

First, AGZ used a truly massive neural network. The strongest version ran on a network with capacity roughly equivalent to 80 convolution layers—over four times the size of the original AlphaGo network.

Second, AGZ used an innovative new reinforcement-learning technique. The original AlphaGo trained its policy network alone, in a manner similar to what we described in chapter 10; later that policy network was used to improve tree search. In contrast, AGZ integrated tree search with reinforcement learning from the beginning. This algorithm is the focus of this chapter.

To start, we go over the structure of the neural network that AGZ trains. Next, we describe the tree-search algorithm in depth. AGZ uses the same tree search in both self-play and competitive games. After that, we cover how AGZ trains its network from its experience data. To wrap up, we briefly cover a few practical tricks that AGZ uses to make the training process stable and efficient.

14.1. Building a neural network for tree search

AlphaGo Zero uses a single neural network with one input and two outputs: one output produces a probability distribution over moves, and the other output produces a single value representing whether the game favors white or black. This is the same structure we used for actor-critic learning in chapter 12.

There’s one small difference between the output of the AGZ network and the network we used in chapter 12, and the difference is around passing in the game. In previous cases where we implemented self-play, we hardcoded logic around passing to end the game. For example, the PolicyAgent self-play bot from chapter 9 included custom logic so that it wouldn’t fill in its own eyes, thereby killing its own stones. If the only legal moves were self-destructive, the PolicyAgent would pass. This ensured that self-play games ended in a sensible place.

Because AGZ uses a tree search during self-play, you don’t need that custom logic. You can treat a pass the same as any other move, and expect the bot to learn when passing is appropriate. If the tree search reveals that playing a stone will lose the game, it’ll pick a pass instead. This means your action output needs to return a probability for passing along with every point on the board. Instead of returning a vector of size 19 × 19 = 361 to represent each point on the board, your network will produce a vector of size 19 × 19 + 1 = 362 to represent each point on the board and the pass move. Figure 14.1 illustrates this new move encoding.

Figure 14.1. Encoding possible moves as a vector. As in previous chapters, AGZ uses a vector where each element maps to a point on the game board. AGZ adds one last element that maps to the pass move. This example is on a 5 × 5 board, so the vector has dimension 26: 25 for the points on the board and 1 for passing.

This means you have to modify the board encoder slightly. In previous board encoders, you implemented encode_point and decode_point_index, which translated between the elements of the vector and points on the board. For an AGZ-style bot, you’ll replace these with new functions, encode_move and decode_move_index. The encoding for playing a stone remains the same; you use the next index to represent a pass.

Listing 14.1. Modifying the board encoder to include passing

class ZeroEncoder(Encoder):
...
    def encode_move(self, move):
        if move.is_play:
            return (self.board_size * (move.point.row - 1) +  1
                (move.point.col - 1))                         1
        elif move.is_pass:
            return self.board_size * self.board_size          2
        raise ValueError('Cannot encode resign move')         3

    def decode_move_index(self, index):
        if index == self.board_size * self.board_size:
            return Move.pass_turn()
        row = index // self.board_size
        col = index % self.board_size
        return Move.play(Point(row=row + 1, col=col + 1))

    def num_moves(self):
        return self.board_size * self.board_size + 1

1 Same point encoding as used in previous encoders
2 Uses the next index to represent a pass
3 The neural network doesn’t learn resignation.

Apart from the treatment of passing, the input and output of the AGZ network are identical to what we covered in chapter 12. For the inner layers of the network, AGZ used an extremely deep stack of convolution layers, with a few modern enhancements to make training smoother (we cover those briefly at the end of this chapter). A large network is powerful, but it also requires more computation, both for training and self-play. If you don’t have access to the same kind of hardware as DeepMind, you may have better luck with a smaller network. Feel free to experiment to find the right balance of power and speed for your needs.

As for board encoding, you could use any encoding scheme we covered in this book, from the basic encoder in chapter 6 up to the 48-plane encoder in chapter 13. AlphaGo Zero used the simplest possible encoder: just the location of the black and white stones on the board, plus a plane indicating whose turn it is. (To handle ko, AGZ also included planes for the previous seven board positions.) But there’s no technical reason why you can’t use game-specific feature planes, and it’s possible they’d make learning faster. In part, the researchers wanted to remove as much human knowledge as they could just to prove it was possible. In your own experiments, you should feel free to try different combinations of feature planes while using the AGZ reinforcement-learning algorithm.

14.2. Guiding tree search with a neural network

In reinforcement learning, a policy is an algorithm that tells an agent how to make decisions. In previous examples of reinforcement learning, the policy was relatively simple. In policy gradient learning (chapter 10) and actor-critic learning (chapter 12), a neural network directly told you which move to pick: that’s the policy. In Q-learning (chapter 11), the policy involved computing the Q-value for each possible move; then you picked the move with the highest Q.

AGZ’s policy includes a form of tree search. You’ll still use a neural network, but the purpose of the neural network is to guide the tree search, rather than to choose or evaluate moves directly. Including tree search during self-play means the self-play games are more realistic. In turn, that means the training process is more stable.

The tree-search algorithm builds on ideas you’ve already studied. If you’ve studied the Monte Carlo tree-search algorithm (chapter 4) and the original AlphaGo (chapter 13), the AlphaGo Zero tree-search algorithm will seem familiar; table 14.1 compares the three algorithms. First, we’ll describe the data structure that AGZ uses to represent a game tree. Next, we’ll walk through the algorithm that AGZ uses to add a new position to the game tree.

Table 14.1. Comparing tree-search algorithms

	MCTS	AlphaGo	AlphaGo Zero
Branch selection	UCT score	UCT score + prior from policy network	UCT score + prior from combined network
Branch evaluation	Randomized playouts	Value network + randomized playouts	Value from combined network

The general idea of tree-search algorithms, as they apply to board games, is to find the move that leads to the best outcome. You determine that by examining possible sequences of moves that might follow. But the number of possible sequences is beyond enormous, so you need to make a decision while examining only a tiny fraction of the possible sequences. The art and science of tree-search algorithms is in how to pick the branches to explore in order to get the best result in the least time.

Just as in MCTS, the AGZ tree-search algorithm runs for a certain number of rounds, and in each round it adds another board position to the tree. As you execute more and more rounds, the tree continues to grow larger, and the algorithm’s estimates get more accurate. For the purposes of illustration, imagine that you’re already in the middle of the algorithm: you’ve already built up a partial tree, and you want to expand the tree with a new board position. Figure 14.2 shows such an example game tree.

Figure 14.2. A partial AGZ-style search tree. In this case, it’s black’s turn to move, and the search has explored three possible game states so far. The tree also contains branches that represent moves the search hasn’t yet visited; most of those have been omitted for space.

Each node in the game tree represents a possible board position. From that position, you also know which follow-up moves are legal. The algorithm has already visited some of those follow-up moves, but not all of them. You create a branch for each follow-up move, whether you’ve visited or not. Each branch tracks the following:

A prior probability of the move, indicating how good you expect this move to be, before you try visiting it.
The number of times you’ve visited the branch during the tree search. This may be zero.
The expected value of all visits that passed through this branch. This is an average over all visits through the tree. To make updating this average easier, you store the sum of the values; you can then divide by the visit count to get the average.

For each branch that you have visited, the node also contains a pointer to a child node. In the following listing, you define a minimal Branch structure to contain branch statistics.

Listing 14.2. A structure to track branch statistics

class Branch:
    def __init__(self, prior):
        self.prior = prior
        self.visit_count = 0
        self.total_value = 0.0

Now you’re ready to build the structure that represents the search tree. The following listing defines a ZeroTreeNode class.

Listing 14.3. A node in an AGZ-style search tree

class ZeroTreeNode:
    def __init__(self, state, value, priors, parent, last_move):
        self.state = state
        self.value = value
        self.parent = parent                      1
        self.last_move = last_move                1
        self.total_visit_count = 1
        self.branches = {}
        for move, p in priors.items():
            if state.is_valid_move(move):
                self.branches[move] = Branch(p)
        self.children = {}                        2

    def moves(self):                              3
        return self.branches.keys()               3

    def add_child(self, move, child_node):        4
        self.children[move] = child_node          4

    def has_child(self, move):                    5
        return move in self.children              6

1 In the root of the tree, parent and last_move will be None.
2 Later, children will map from a Move to another ZeroTreeNode.
3 Returns a list of all possible moves from this node
4 Allows adding new nodes into the tree
5 Checks whether there’s a child node for a particular move
6 Returns a particular child node

The ZeroTreeNode class also includes some helpers for reading the statistics off its children.

Listing 14.4. Helpers to read branch information from a tree node

class ZeroTreeNode:
...
    def expected_value(self, move):
        branch = self.branches[move]
        if branch.visit_count == 0:
            return 0.0
        return branch.total_value / branch.visit_count

    def prior(self, move):
        return self.branches[move].prior

    def visit_count(self, move):
        if move in self.branches:
            return self.branches[move].visit_count
        return 0

14.2.1. Walking down the tree

In each round of the search, you start by walking down the tree. The point is to see what a possible future board position is, so you can evaluate whether it’s good. To get an accurate evaluation, you should assume that your opponent will respond to your moves in the strongest possible way. Of course, you don’t know what the strongest response is yet; you must try out a variety of moves to find out which are good. This section describes an algorithm for selecting strong moves in the face of uncertainty.

The expected value provides an estimate of how good each possible move is. But the estimates aren’t equally accurate. If you’ve spent more time reading out a particular branch, its estimate will be better.

You can continue to read out one of the best variations in more detail, which will further improve its estimate. Alternately, you can read out a branch that you’ve explored less, in order to improve your estimate. Maybe that move is better than you initially thought; the only way to find out is expanding it further. Once again, you see the opposing goals of exploitation and exploration.

The original MCTS algorithm used the UCT (upper confidence bound for trees; see chapter 4) formula to balance these goals. The UCT formula balanced two priorities:

If you’ve visited a branch many times, you trust its expected value. In that case, you prefer the branches with higher estimated values.
If you’ve visited a branch only a few times, its expected value may be way off. Whether its expected value is good or bad, you want to visit it a few times to improve its estimate.

AGZ adds a third factor:

Among branches with few visits, you prefer the ones with a high prior. Those are the moves that intuitively look good, before considering the exact details of this game.

Mathematically, AGZ’s scoring function looks like this:

The parts of the equation are as follows:

Q is the expected value averaged over all visits through a branch. (It’s zero if you haven’t visited the branch yet.)
P is the prior for the move under consideration.
N is the number of visits to the parent node.
n is the number of visits to the child branch.
c is a factor that balances exploration against exploitation (generally, you have to set this by trial and error).

Look at the example in figure 14.3. Branch A has been visited twice and has a slightly good evaluation of Q = 0.1. Branch B has been visited once and has a bad evaluation: Q = –0.5. Branch C has no visits yet but has a prior probability of P = 0.038.

Figure 14.3. Choosing which branch to follow in the AGZ tree search. In this example, you’re considering three branches from the starting position. (In reality, there would be many more possible moves, which we omitted for space.) To choose a branch, you consider the number of times you’ve already visited the branch, your estimated value for the branch, and the prior probability of the move.

Table 14.2 shows how to calculate the uncertainty component. Branch A has the highest Q component, indicating that you’ve seen some good board positions underneath it. Branch C has the highest UCT component: we’ve never visited, so we have the highest uncertainty around that branch. Branch B has a lower evaluation than A, and more visits than C, so it’s not likely to be a good choice at this point.

Table 14.2. Choosing a branch to follow

	Q	n	N	P	P√N / (n + 1)
Branch A	0.1	2	3	0.068	0.039
Branch B	–0.5	1	3	0.042	0.036
Branch C	0	0	3	0.038	0.065

Assuming you’ve eliminated branch B, how do you choose between branch A and branch C? It depends on the value of the parameter c. A small value of c favors the high-value branch (in this case, A). A large value of c favors the branch with most uncertainty (in this case, C). For example, at c = 1.0, you’d choose A (at a score of 0.139 to 0.065). At c = 4.0, you’d choose C (0.260 to 0.256). Neither is objectively right; it’s just a trade-off. The following listing shows how to calculate this score in Python.

Listing 14.5. Choosing a child branch

class ZeroAgent(Agent):
...
    def select_branch(self, node):
        total_n = node.total_visit_count

        def score_branch(move):
            q = node.expected_value(move)
            p = node.prior(move)
            n = node.visit_count(move)
            return q + self.c * p * np.sqrt(total_n) / (n + 1)

        return max(node.moves(), key=score_branch)             1

1 node.moves() is a list of moves. When you pass in key=score_branch, then max will return the move with the highest value of the score_branch function.

After you’ve chosen a branch, you repeat the same calculation on its children to choose the next branch. You continue the same process until you reach a branch with no children.

Listing 14.6. Walking down the search tree

class ZeroAgent(Agent):
...
    def select_move(self, game_state):
        root = self.create_node(game_state)            1

        for i in range(self.num_rounds):               2
            node = root
            next_move = self.select_branch(node)
            while node.has_child(next_move):           3
                node = node.get_child(next_move)
                next_move = self.select_branch(node)

1 The next section shows the implementation of create_node.
2 This is the first step in a process that repeats many times per move. self.num_moves controls the number of times you repeat the search process.
3 When has_child returns False, you’ve reached the bottom of the tree.

14.2.2. Expanding the tree

At this point, you’ve reached an unexpanded branch of the tree. You can’t search any further, because there’s no node in the tree corresponding to the current move. The next step is to create a new node and add it to the tree.

To create a new node, you take the previous game state and apply the current move to get a new game state. You can then feed the new game state to your neural network, which gives you two valuable things. First, you get the prior estimates for all possible follow-up moves from the new game state. Second, you get the estimated value of the new game state. You use this information to initialize the statistics for the branches from this new node.

Listing 14.7. Creating a new node in the search tree

class ZeroAgent(Agent):
...
    def create_node(self, game_state, move=None, parent=None):
        state_tensor = self.encoder.encode(game_state)
        model_input = np.array([state_tensor])            1
        priors, values = self.model.predict(model_input)
        priors = priors[0]                                2
        value = values[0][0]                              2
        move_priors = {                                   3
            self.encoder.decode_move_index(idx): p        3
            for idx, p in enumerate(priors)               3
        }                                                 3
        new_node = ZeroTreeNode(
            game_state, value,
            move_priors,
            parent, move)
        if parent is not None:
            parent.add_child(move, new_node)
        return new_node

1 The Keras predict function is a batch function that takes an array of examples. You must wrap your board_tensor in an array.
2 Likewise, predict returns arrays with multiple results, so you must pull out the first item.
3 Unpack the priors vector into a dictionary mapping from Move objects to their corresponding prior probabilities.

Finally, you walk back up the tree and update the statistics for each parent that lead to this node, as shown in figure 14.4. For each node in this path, you increment the visit count and update the total expected value. At each node, the perspective switches between the black player and the white player, so you need to flip the sign of the value at each step.

Listing 14.8. Expanding the search tree and updating all node statistics

class ZeroTreeNode:
...
    def record_visit(self, move, value):
        self.total_visit_count += 1
        self.branches[move].visit_count += 1
        self.branches[move].total_value += value

class ZeroAgent(Agent):
...
    def select_move(self, game_state):
...
            new_state = node.state.apply_move(next_move)
            child_node = self.create_node(
                new_state, parent=node)

            move = next_move
            value = -1 * child_node.value          1
            while node is not None:
                node.record_visit(move, value)
                move = node.last_move
                node = node.parent
                value = -1 * value

1 At each level in the tree, you switch perspective between the two players. Therefore, you must multiply the value by –1: what’s good for black is bad for white, and vice versa.

Figure 14.4. Expanding an AGZ-style search tree. First, you calculate a new game state. You create a new node from that game state and add it to the tree. The neural network then gives you a value estimate for that game state. Finally, you update all parents of the new node. You increment the visit count N by one and update the average value V. Here, T represents the total value across all visits through a node; that’s just bookkeeping to make it easy to recalculate the average.

This whole process repeats over and over, and the tree expands each time. AGZ used 1,600 rounds per move during the self-play process. In competitive games, you should run the algorithm for as many rounds as you have time for. The bot will choose better and better moves as it performs more rounds.

14.2.3. Selecting a move

After you’ve built up the search tree as deeply as you can, it’s time to select a move. The simplest rule for move selection is to pick the move with the highest visit count.

Why use the visit counts and not the expected value? You can assume the branch with the most visits has a high expected value. Here’s why. Refer to the preceding branch-selection formula. As the number of visits to a branch grows, the factor of 1 / (n + 1) gets smaller and smaller. Therefore, the branch selection function will just pick based on Q. The branches with higher Q values get the most visits.

Now, if a branch has only a few visits, anything’s possible. It may have a small Q or a huge Q. But you also can’t trust its estimate based on a small number of visits. If you just picked the branch with the highest Q, you might get a branch with just one visit, and its true value may be much smaller. That’s why you select based on visit counts instead. It guarantees that you pick a branch with a high estimated value and a reliable estimate.

Listing 14.9. Selecting the move with the highest visit count

class ZeroAgent(Agent):
...
    def select_move(self, game_state):
...
        return max(root.moves(), key=root.visit_count)

In contrast to other self-play agents in this book, the ZeroAgent has no special logic around when to pass. That’s because you include passing in the search tree: you can treat it just like any other move.

With our ZeroAgent implementation complete, you can now implement your simulate_game function for self-play.

Listing 14.10. Simulating a self-play game

def simulate_game(
        board_size,
        black_agent, black_collector,
        white_agent, white_collector):
    print('Starting the game!')
    game = GameState.new_game(board_size)
    agents = {
        Player.black: black_agent,
        Player.white: white_agent,
    }

    black_collector.begin_episode()
    white_collector.begin_episode()
    while not game.is_over():
        next_move = agents[game.next_player].select_move(game)
        game = game.apply_move(next_move)

    game_result = scoring.compute_game_result(game)
    if game_result.winner == Player.black:
        black_collector.complete_episode(1)
        white_collector.complete_episode(-1)
    else:
        black_collector.complete_episode(-1)
        white_collector.complete_episode(1)

14.3. Training

The training target for the value output is a 1 if the agent won the game, and a –1 if it lost. By averaging over many games, you learn a value between those extremes that indicates your bot’s chance of winning. It’s the exact same setup you used for Q-learning (in chapter 11) and for actor-critic learning (chapter 12).

The action output is a little different. Just as in policy learning (chapter 10) and actor-critic learning (chapter 12), the neural network has an output that produces a probability distribution over legal moves. In policy learning, you trained a network to match the exact move that the agent chose (in cases where the agent won the game). AGZ works differently in a subtle way. It trains its network to match the number of times it visited each move during the tree search.

To illustrate how that can improve its playing strength, think about how the MCTS-style search algorithms work. Assume, for the moment, that you have a value function that’s at least vaguely correct; it doesn’t have to be super precise as long as it roughly separates winning positions from losing positions. Then imagine that you throw out the prior probabilities entirely and run the search algorithm. By design, the search will spend more time in the most promising branches. The branch selection logic makes this happen: the Q component in the UCT formula means high-value branches are selected more often. If you had unlimited time to run the search, it’d eventually converge on the best moves.

After a sufficient number of rounds in the tree search, you can think of the visit counts as the source of truth. You know whether these moves are good or bad because you checked what happens if you play them. So the search counts become your target values for training the prior function.

The prior function tries to predict where the tree search would spend its time, if you gave it plenty of time to run. Armed with a function trained on previous runs, your tree search can save time and go straight into searching out the more important branches. With an accurate prior function, your search algorithm can spend just a small number of rollouts, but get similar results to a slower search that required a much larger number of rollouts. In a sense, you can think that the network is “remembering” what happened in previous searches and using that knowledge to skip ahead.

To set up training in this way, you need to store the search counts after each move. In previous chapters, you used a generic ExperienceCollector that could apply to many RL implementations. In this case, however, the search counts are specific to AGZ, so you’ll make a custom collector. The structure is mostly the same.

Listing 14.11. A specialized experience collector for AGZ-style learning

class ZeroExperienceCollector:
    def __init__(self):
        self.states = []
        self.visit_counts = []
        self.rewards = []
        self._current_episode_states = []
        self._current_episode_visit_counts = []

    def begin_episode(self):
        self._current_episode_states = []
        self._current_episode_visit_counts = []

    def record_decision(self, state, visit_counts):
        self._current_episode_states.append(state)
        self._current_episode_visit_counts.append(visit_counts)

    def complete_episode(self, reward):
        num_states = len(self._current_episode_states)
        self.states += self._current_episode_states
        self.visit_counts += self._current_episode_visit_counts
        self.rewards += [reward for _ in range(num_states)]

        self._current_episode_states = []
        self._current_episode_visit_counts = []

Listing 14.12. Passing along the decision to the experience collector

class ZeroAgent(Agent):
...
    def select_move(self, game_state):
...
        if self.collector is not None:
            root_state_tensor = self.encoder.encode(game_state)
            visit_counts = np.array([
                root.visit_count(
                    self.encoder.decode_move_index(idx))
                for idx in range(self.encoder.num_moves())
            ])
            self.collector.record_decision(
                root_state_tensor, visit_counts)

The action output of your neural network uses a softmax activation. Recall that the softmax activation ensures that its values sum to 1. For training, you should also make sure that the training target sums to 1. To do this, divide the total visit counts by its sum; this operation is called normalizing. Figure 14.5 shows an example.

Figure 14.5. Normalizing a vector. During self-play, you track the number of times you visit each move. For training, you must normalize the vector so it sums to 1.

Other than that, the training process looks similar to training an actor-critic network in chapter 12. The following listing shows the implementation.

Listing 14.13. Training the combined network

class ZeroAgent(Agent):
...
    def train(self, experience, learning_rate, batch_size):     1
        num_examples = experience.states.shape[0]

        model_input = experience.states

        visit_sums = np.sum(                                    2
            experience.visit_counts, axis=1).reshape(           2
            (num_examples, 1))                                  2
        action_target = experience.visit_counts / visit_sums    2

        value_target = experience.rewards

        self.model.compile(
            SGD(lr=learning_rate),
            loss=['categorical_crossentropy', 'mse'])
        self.model.fit(
            model_input, [action_target, value_target],
            batch_size=batch_size)

1 See chapter 10 for a discussion of learning_rate and batch_size.
2 Normalizes the visit counts. When you call np.sum with axis=1, it sums along each row of the matrix. The reshape call reorganizes those sums into matching rows. Then you can divide the original counts by their sums.

The overall reinforcement-learning cycle is the same as what you studied in chapters 9 through 12:

Generate a huge batch of self-play games.
Train the model on the experience data.
Test the updated model against the previous version.
If the new version is measurably stronger, switch to the new version.
If not, generate more self-play games and try again.
Repeat as many times as needed.

Listing 14.14 shows an example that runs a single cycle of this process. Fair warning: you’ll need a lot of self-play games to build a strong Go AI from nothing. AlphaGo Zero reached a superhuman level of play, but it took nearly 5 million self-play games to get there.

Listing 14.14. A single cycle of the reinforcement-learning process

board_size = 9
encoder = zero.ZeroEncoder(board_size)

board_input = Input(shape=encoder.shape(), name='board_input')
pb = board_input
for i in range(4):                                          1
    pb = Conv2D(64, (3, 3),                                 1
        padding='same',                                     1
        data_format='channels_first',                       1
        activation='relu')(pb)                              1

policy_conv =                                              2
    Conv2D(2, (1, 1),                                       2
        data_format='channels_first',                       2
        activation='relu')(pb)                              2
policy_flat = Flatten()(policy_conv)                        2
policy_output =                                            2
    Dense(encoder.num_moves(), activation='softmax')(       2
        policy_flat)                                        2

value_conv =                                               3
    Conv2D(1, (1, 1),                                       3
        data_format='channels_first',                       3
        activation='relu')(pb)                              3
value_flat = Flatten()(value_conv)                          3
value_hidden = Dense(256, activation='relu')(value_flat)    3
value_output = Dense(1, activation='tanh')(value_hidden)    3

model = Model(
    inputs=[board_input],
    outputs=[policy_output, value_output])

black_agent = zero.ZeroAgent(
    model, encoder, rounds_per_move=10, c=2.0)              4
white_agent = zero.ZeroAgent(
    model, encoder, rounds_per_move=10, c=2.0)
c1 = zero.ZeroExperienceCollector()
c2 = zero.ZeroExperienceCollector()
black_agent.set_collector(c1)
white_agent.set_collector(c2)

for i in range(5):                                          5
    simulate_game(board_size, black_agent, c1, white_agent, c2)

exp = zero.combine_experience([c1, c2])
black_agent.train(exp, 0.01, 2048)

1 Build a network with four convolutional layers. To build a strong bot, you can add many more layers.
2 Add the action output to the network.
3 Add the value output to the network.
4 We use 10 rounds per move here just so the demo runs quickly. For real training, you’ll need much more; AGZ used 1,600.
5 Simulate five games before training. For real training, you’ll want to train on much larger batches (thousands of games).

14.4. Improving exploration with Dirichlet noise

Self-play reinforcement learning is an inherently random process. Your bot can easily drift in a weird direction, especially early in the training process. To prevent the bot from getting stuck, it’s important to provide a little randomness. That way, if the bot gets fixated on a really terrible move, it’ll have a small chance to learn a better move. In this section, we describe one of the tricks AGZ used to ensure good exploration.

In previous chapters, you used a few different techniques to add variety to a bot’s selections. For example, in chapter 9, you randomly sampled from your bot’s policy output; in chapter 11, you used the ϵ-greedy algorithm: some fraction ϵ of the time, the bot would ignore its model entirely and choose a totally random move instead. In both cases, you added randomness at the time the bot made a decision. AGZ uses a different method to introduce randomness earlier, during the search process.

Imagine that on each turn, you artificially boosted the prior of one or two randomly chosen moves. Early in the search process, the prior controls which branches get explored, so those moves will get extra visits. If they turn out to be bad moves, the search will quickly move on to other branches, so no harm done. But this would ensure that every move gets a few visits occasionally, so the search doesn’t develop blind spots.

AGZ achieves a similar effect by adding noise—small random numbers—to the priors at the root of each search tree. By drawing the noise from a Dirichlet distribution, you get the exact effect described previously: a few moves get an artificial boost, while the others stay untouched. In this section, we explain the properties of the Dirichlet distribution and show how to generate Dirichlet noise by using NumPy.

Throughout this book, you’ve used probability distributions over game moves. When you sample from such a distribution, you get a particular move. The Dirichlet distribution is a probability distribution over probability distributions: when you sample from the Dirichlet distribution, you get another probability distribution. The NumPy function np.random.dirichlet generates samples from a Dirichlet distribution. It takes a vector argument and returns a vector of the same dimension. The following listing shows a few example draws: the result is a vector, and it always sums to 1—meaning the result is itself a valid probability distribution.

Listing 14.15. Using `np.random.dirichlet` to sample from a Dirichlet distribution

>>> import numpy as np
>>> np.random.dirichlet([1, 1, 1])
array([0.1146, 0.2526, 0.6328])
>>> np.random.dirichlet([1, 1, 1])
array([0.1671, 0.5378, 0.2951])
>>> np.random.dirichlet([1, 1, 1])
array([0.4098, 0.1587, 0.4315])

You can control the output of the Dirichlet distribution with a concentration parameter, usually denoted as α. When α is close to 0, the Dirichlet distribution will generate “lumpy” vectors: most of the values with be close to 0, and just a few values will be larger. When α is large, the samples will be “smooth”: the values will be closer to each other. The following listing shows the effect of changing the concentration parameter.

Listing 14.16. Drawing from a Dirichlet distribution when α is close to zero

>>> import numpy as np

>>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1])   1
array([0.    , 0.044 , 0.7196, 0.2364])
>>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1])
array([0.0015, 0.0028, 0.9957, 0.    ])
>>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1])
array([0.    , 0.9236, 0.0002, 0.0763])

>>> np.random.dirichlet([10, 10, 10, 10])       2
array([0.3479, 0.1569, 0.3109, 0.1842])
>>> np.random.dirichlet([10, 10, 10, 10])
array([0.3731, 0.2048, 0.0715, 0.3507])
>>> np.random.dirichlet([10, 10, 10, 10])
array([0.2119, 0.2174, 0.3042, 0.2665])

1 Drawing from a Dirichlet distribution with a small concentration parameter. The results are “lumpy”: most of the mass is concentrated in one or two elements.
2 Drawing from a Dirichlet distribution with a large concentration parameter. In each result, the mass is spread evenly over all elements of the vector.

This shows a recipe for modifying your priors. By choosing a small α, you get a distribution in which a few moves have high probabilities, and the rest are close to zero. Then you can take a weighted average of the true priors with the Dirichlet noise. AGZ used a concentration parameter of 0.03.

14.5. Modern techniques for deeper neural networks

Neural network design is a hot research topic. One never-ending problem is how to make training stable on deeper and deeper networks. AlphaGo Zero applied a couple of cutting-edge techniques that are quickly becoming standards. The details are beyond the scope of this book, but we introduce them at a high level here.

14.5.1. Batch normalization

The idea of deep neural networks is that each layer can learn an increasingly high-level representation of the original data. But what exactly are these representations? What we mean is that some meaningful property of the original data will show up as a particular numeric value in the activation of a particular neuron in the layer. But the mapping between actual numbers is completely arbitrary. If you multiply every activation in a layer by 2, for example, you haven’t lost any information: you’ve just changed the scale. In principle, such a transformation doesn’t affect the network’s capacity to learn.

But the absolute value of the activations can affect practical training performance. The idea of batch normalization is to shift each layer’s activations so they’re centered around 0, and scale them so the variance is 1. At the start of training, you don’t know what the activations will look like. Batch normalization provides a scheme for learning the correct shift and scale on the fly during the training process; the normalizing transform will adjust as its inputs change throughout training.

How does batch normalization improve training? That’s still an open area of research. The original researchers developed batch normalization in order to reduce covariate shift. The activations in any layer tend to drift during training. Batch normalization corrects that drift, reducing the learning burden on the later layers. But the latest research suggests that the covariate shift may not be as important as initially thought. Instead, the value may be in the way batch normalization makes the loss function smoother.

Although researchers are still studying why batch normalization works, it’s well established that it does work. Keras provides a BatchNormalization layer that you can add to your networks. The following listing shows an example of adding batch normalization to a convolution layer in Keras.

Listing 14.17. Adding batch normalization to a Keras network

from keras.models import Sequential
from keras.layers import Activation, BatchNormalization, Conv2D

model = Sequential()
model.add(Conv2D(64, (3, 3), data_format='channels_first'))
model.add(BatchNormalization(axis=1))                        1
model.add(Activation('relu'))                                2

1 The axis should match the convolution data_format. For channels_first, use axis=1 (the first axis). For channels_last, use axis=-/1 (the last axis).
2 The normalization happens between the convolution and the relu activation.

14.5.2. Residual networks

Imagine you’ve successfully trained a neural network that has three hidden layers in the middle. What happens if you add a fourth layer? In theory, this should strictly increase the capacity of your network. In the worst case, when you train the four-layer network, the first three layers could learn the same things they did in the three-layer network, while the fourth layer just passes its input untouched. You’d hope it could also learn more, but you wouldn’t expect it to learn less. At least, you’d expect the deeper network should be able to overfit (memorize the training set in a way that doesn’t necessarily translate to new examples).

In reality, this doesn’t always happen. When you try to train a four-layer network, there are more possible ways to organize the data than on a three-layer network. Sometimes, because of the quirks of stochastic gradient descent on complicated loss surfaces, you can add more layers and find you can’t even overfit. The idea of a residual network is to simplify what the extra layer is trying to learn. If three layers can do an OK job learning a problem, you can force the fourth layer to focus on learning the gap between whatever the first three layers learned and the objective. (That gap is the residual, hence the name.)

To implement this, you sum the input to your extra layers with the output from your extra layers, as illustrated in figure 14.6. The connection from the previous layer to the summation layer is called a skip connection. Normally, residual networks are organized into small blocks; there are around two or three layers per block with a skip connection beside them. Then you can stack as many blocks together as needed.

Figure 14.6. A residual block. The output of the two inner layers is added to the output of the previous layer. The effect is that the inner layers learn the difference, or residual, between the objective and what the previous layer learned.

14.6. Exploring additional resources

If you’re interested in experimenting with AlphaGo Zero-style bots, there are many open source projects inspired by the original AGZ paper. If you want a superhuman Go AI, either to play against or to study the source code, you have an embarrassment of riches.

Leela Zero is an open source implementation of an AGZ-style bot. The self-play process is distributed: if you have CPU cycles to spare, you can generate self-play games and upload them for training. At this writing, the community has contributed over 8 million games, and Leela Zero is already strong enough to beat professional Go players. http://zero.sjeng.org/
Minigo is another open source implementation, written in Python with TensorFlow. It’s fully integrated with Google Cloud Platform so you can use Google’s public cloud to run experiments. https://github.com/tensorflow/minigo
Facebook AI Research implemented the AGZ algorithm on top of its ELF reinforcement learning platform. The result, ELF OpenGo, is now freely available, and it’s among the strongest Go AIs today. https://facebook.ai/developers/tools/elf
Tencent has also implemented and trained an AGZ-style bot, which they have released as PhoenixGo. The bot is also known as BensonDarr on the Fox Go server, where it has beaten many of the world’s top players. https://github.com/Tencent/PhoenixGo
If Go isn’t your thing, Leela Chess Zero is a fork of Leela Zero that has been adapted to learn chess instead. It’s already at least as strong as human grandmasters, and chess fans have praised its exciting and creative play. https://github.com/LeelaChessZero/lczero

14.7. Wrapping up

That wraps up our introduction to the cutting-edge AI techniques that power modern Go AIs. We encourage you to take matters into your own hands from here: either experiment with your own Go bot, or try applying these modern techniques to other games.

But also think beyond games. When you read about the latest application of machine learning, you now have a framework for understanding what’s happening. Think about the following:

What is the model or neural network structure?
What is the loss function or training objective?
What is the training process?
How are the inputs and outputs encoded?
How can the model fit in with traditional algorithms or practical software applications?

We hope that we’ve inspired you to try your own experiments with deep learning, whether they’re in games or another field.

14.8. Summary

AlphaGo Zero uses a single neural network with two outputs. One output indicates which moves are important, and the other output indicates which player is ahead.
The AlphaGo Zero tree-search algorithm is similar to Monte Carlo tree search, with two major differences. Instead of using random games to evaluate a position, it relies solely on a neural network. In addition, it uses a neural network to guide the search toward new branches.
AlphaGo Zero’s neural network is trained against the number of times it visited particular moves in the search process. In that way, it’s specifically trained to enhance tree search, rather than to select moves directly.
A Dirichlet distribution is a probability distribution over probability distributions. The concentration parameter controls how clumpy the resulting probability distributions are. AlphaGo Zero uses Dirichlet noise to add controlled randomness to its search process, to make sure all moves get explored occasionally.
Batch normalization and residual networks are two modern techniques that help you train very deep neural networks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

Create new playlist

Sign In

Sign Up

Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

14.1. Building a neural network for tree search

Listing 14.1. Modifying the board encoder to include passing

14.2. Guiding tree search with a neural network

Table 14.1. Comparing tree-search algorithms

Figure 14.2. A partial AGZ-style search tree. In this case, it’s black’s turn to move, and the search has explored three possible game states so far. The tree also contains branches that represent moves the search hasn’t yet visited; most of those have been omitted for space.

Listing 14.2. A structure to track branch statistics

Listing 14.3. A node in an AGZ-style search tree

Listing 14.4. Helpers to read branch information from a tree node

14.2.1. Walking down the tree

Table 14.2. Choosing a branch to follow

Listing 14.5. Choosing a child branch

Listing 14.6. Walking down the search tree

14.2.2. Expanding the tree

Listing 14.7. Creating a new node in the search tree

Listing 14.8. Expanding the search tree and updating all node statistics

14.2.3. Selecting a move

Listing 14.9. Selecting the move with the highest visit count

Listing 14.10. Simulating a self-play game

14.3. Training

Listing 14.11. A specialized experience collector for AGZ-style learning

Listing 14.12. Passing along the decision to the experience collector

Figure 14.5. Normalizing a vector. During self-play, you track the number of times you visit each move. For training, you must normalize the vector so it sums to 1.

Listing 14.13. Training the combined network

Listing 14.14. A single cycle of the reinforcement-learning process

14.4. Improving exploration with Dirichlet noise

Listing 14.15. Using np.random.dirichlet to sample from a Dirichlet distribution

Listing 14.16. Drawing from a Dirichlet distribution when α is close to zero

14.5. Modern techniques for deeper neural networks

14.5.1. Batch normalization

Listing 14.17. Adding batch normalization to a Keras network

14.5.2. Residual networks

Figure 14.6. A residual block. The output of the two inner layers is added to the output of the previous layer. The effect is that the inner layers learn the difference, or residual, between the objective and what the previous layer learned.

14.6. Exploring additional resources

14.7. Wrapping up

14.8. Summary

Table of Contents for
Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

Listing 14.15. Using `np.random.dirichlet` to sample from a Dirichlet distribution