This chapter covers
After DeepMind revealed the second edition of AlphaGo, code-named Master, Go fans all over the world scrutinized its shocking style of play. Master’s games were full of surprising new moves. Although Master was bootstrapped from human games, it was continuously enhanced with reinforcement learning, and that enabled it to discover new moves that humans didn’t play.
This led to an obvious question: what if AlphaGo didn’t rely on human games at all, but instead learned entirely using reinforcement learning? Could it still reach a superhuman level, or would it get stuck playing with beginners? Would it rediscover patterns played by human masters, or would it play in an incomprehensible new alien style? All these questions were answered when AlphaGo Zero (AGZ) was announced in 2017.
AlphaGo Zero was built on an improved reinforcement-learning system, and it trained itself from scratch without any input from human games. Although its first games were worse than any human beginner’s, AGZ improved steadily and quickly surpassed every previous edition of AlphaGo.
To us, the most astonishing thing about AlphaGo Zero is how it does more with less. In many ways, AGZ is much simpler than the original AlphaGo. No more handcrafted feature planes. No more human game records. No more Monte Carlo rollouts. Instead of two neural networks and three training processes, AlphaGo Zero used one neural network and one training process.
And yet AlphaGo Zero was stronger than the original AlphaGo! How is that possible?
First, AGZ used a truly massive neural network. The strongest version ran on a network with capacity roughly equivalent to 80 convolution layers—over four times the size of the original AlphaGo network.
Second, AGZ used an innovative new reinforcement-learning technique. The original AlphaGo trained its policy network alone, in a manner similar to what we described in chapter 10; later that policy network was used to improve tree search. In contrast, AGZ integrated tree search with reinforcement learning from the beginning. This algorithm is the focus of this chapter.
To start, we go over the structure of the neural network that AGZ trains. Next, we describe the tree-search algorithm in depth. AGZ uses the same tree search in both self-play and competitive games. After that, we cover how AGZ trains its network from its experience data. To wrap up, we briefly cover a few practical tricks that AGZ uses to make the training process stable and efficient.
AlphaGo Zero uses a single neural network with one input and two outputs: one output produces a probability distribution over moves, and the other output produces a single value representing whether the game favors white or black. This is the same structure we used for actor-critic learning in chapter 12.
There’s one small difference between the output of the AGZ network and the network we used in chapter 12, and the difference is around passing in the game. In previous cases where we implemented self-play, we hardcoded logic around passing to end the game. For example, the PolicyAgent self-play bot from chapter 9 included custom logic so that it wouldn’t fill in its own eyes, thereby killing its own stones. If the only legal moves were self-destructive, the PolicyAgent would pass. This ensured that self-play games ended in a sensible place.
Because AGZ uses a tree search during self-play, you don’t need that custom logic. You can treat a pass the same as any other move, and expect the bot to learn when passing is appropriate. If the tree search reveals that playing a stone will lose the game, it’ll pick a pass instead. This means your action output needs to return a probability for passing along with every point on the board. Instead of returning a vector of size 19 × 19 = 361 to represent each point on the board, your network will produce a vector of size 19 × 19 + 1 = 362 to represent each point on the board and the pass move. Figure 14.1 illustrates this new move encoding.
This means you have to modify the board encoder slightly. In previous board encoders, you implemented encode_point and decode_point_index, which translated between the elements of the vector and points on the board. For an AGZ-style bot, you’ll replace these with new functions, encode_move and decode_move_index. The encoding for playing a stone remains the same; you use the next index to represent a pass.
class ZeroEncoder(Encoder): ... def encode_move(self, move): if move.is_play: return (self.board_size * (move.point.row - 1) + 1 (move.point.col - 1)) 1 elif move.is_pass: return self.board_size * self.board_size 2 raise ValueError('Cannot encode resign move') 3 def decode_move_index(self, index): if index == self.board_size * self.board_size: return Move.pass_turn() row = index // self.board_size col = index % self.board_size return Move.play(Point(row=row + 1, col=col + 1)) def num_moves(self): return self.board_size * self.board_size + 1
Apart from the treatment of passing, the input and output of the AGZ network are identical to what we covered in chapter 12. For the inner layers of the network, AGZ used an extremely deep stack of convolution layers, with a few modern enhancements to make training smoother (we cover those briefly at the end of this chapter). A large network is powerful, but it also requires more computation, both for training and self-play. If you don’t have access to the same kind of hardware as DeepMind, you may have better luck with a smaller network. Feel free to experiment to find the right balance of power and speed for your needs.
As for board encoding, you could use any encoding scheme we covered in this book, from the basic encoder in chapter 6 up to the 48-plane encoder in chapter 13. AlphaGo Zero used the simplest possible encoder: just the location of the black and white stones on the board, plus a plane indicating whose turn it is. (To handle ko, AGZ also included planes for the previous seven board positions.) But there’s no technical reason why you can’t use game-specific feature planes, and it’s possible they’d make learning faster. In part, the researchers wanted to remove as much human knowledge as they could just to prove it was possible. In your own experiments, you should feel free to try different combinations of feature planes while using the AGZ reinforcement-learning algorithm.
In reinforcement learning, a policy is an algorithm that tells an agent how to make decisions. In previous examples of reinforcement learning, the policy was relatively simple. In policy gradient learning (chapter 10) and actor-critic learning (chapter 12), a neural network directly told you which move to pick: that’s the policy. In Q-learning (chapter 11), the policy involved computing the Q-value for each possible move; then you picked the move with the highest Q.
AGZ’s policy includes a form of tree search. You’ll still use a neural network, but the purpose of the neural network is to guide the tree search, rather than to choose or evaluate moves directly. Including tree search during self-play means the self-play games are more realistic. In turn, that means the training process is more stable.
The tree-search algorithm builds on ideas you’ve already studied. If you’ve studied the Monte Carlo tree-search algorithm (chapter 4) and the original AlphaGo (chapter 13), the AlphaGo Zero tree-search algorithm will seem familiar; table 14.1 compares the three algorithms. First, we’ll describe the data structure that AGZ uses to represent a game tree. Next, we’ll walk through the algorithm that AGZ uses to add a new position to the game tree.
MCTS |
AlphaGo |
AlphaGo Zero |
|
---|---|---|---|
Branch selection | UCT score | UCT score + prior from policy network | UCT score + prior from combined network |
Branch evaluation | Randomized playouts | Value network + randomized playouts | Value from combined network |
The general idea of tree-search algorithms, as they apply to board games, is to find the move that leads to the best outcome. You determine that by examining possible sequences of moves that might follow. But the number of possible sequences is beyond enormous, so you need to make a decision while examining only a tiny fraction of the possible sequences. The art and science of tree-search algorithms is in how to pick the branches to explore in order to get the best result in the least time.
Just as in MCTS, the AGZ tree-search algorithm runs for a certain number of rounds, and in each round it adds another board position to the tree. As you execute more and more rounds, the tree continues to grow larger, and the algorithm’s estimates get more accurate. For the purposes of illustration, imagine that you’re already in the middle of the algorithm: you’ve already built up a partial tree, and you want to expand the tree with a new board position. Figure 14.2 shows such an example game tree.
Each node in the game tree represents a possible board position. From that position, you also know which follow-up moves are legal. The algorithm has already visited some of those follow-up moves, but not all of them. You create a branch for each follow-up move, whether you’ve visited or not. Each branch tracks the following:
For each branch that you have visited, the node also contains a pointer to a child node. In the following listing, you define a minimal Branch structure to contain branch statistics.
class Branch: def __init__(self, prior): self.prior = prior self.visit_count = 0 self.total_value = 0.0
Now you’re ready to build the structure that represents the search tree. The following listing defines a ZeroTreeNode class.
class ZeroTreeNode: def __init__(self, state, value, priors, parent, last_move): self.state = state self.value = value self.parent = parent 1 self.last_move = last_move 1 self.total_visit_count = 1 self.branches = {} for move, p in priors.items(): if state.is_valid_move(move): self.branches[move] = Branch(p) self.children = {} 2 def moves(self): 3 return self.branches.keys() 3 def add_child(self, move, child_node): 4 self.children[move] = child_node 4 def has_child(self, move): 5 return move in self.children 6
The ZeroTreeNode class also includes some helpers for reading the statistics off its children.
class ZeroTreeNode: ... def expected_value(self, move): branch = self.branches[move] if branch.visit_count == 0: return 0.0 return branch.total_value / branch.visit_count def prior(self, move): return self.branches[move].prior def visit_count(self, move): if move in self.branches: return self.branches[move].visit_count return 0
In each round of the search, you start by walking down the tree. The point is to see what a possible future board position is, so you can evaluate whether it’s good. To get an accurate evaluation, you should assume that your opponent will respond to your moves in the strongest possible way. Of course, you don’t know what the strongest response is yet; you must try out a variety of moves to find out which are good. This section describes an algorithm for selecting strong moves in the face of uncertainty.
The expected value provides an estimate of how good each possible move is. But the estimates aren’t equally accurate. If you’ve spent more time reading out a particular branch, its estimate will be better.
You can continue to read out one of the best variations in more detail, which will further improve its estimate. Alternately, you can read out a branch that you’ve explored less, in order to improve your estimate. Maybe that move is better than you initially thought; the only way to find out is expanding it further. Once again, you see the opposing goals of exploitation and exploration.
The original MCTS algorithm used the UCT (upper confidence bound for trees; see chapter 4) formula to balance these goals. The UCT formula balanced two priorities:
AGZ adds a third factor:
Mathematically, AGZ’s scoring function looks like this:
The parts of the equation are as follows:
Look at the example in figure 14.3. Branch A has been visited twice and has a slightly good evaluation of Q = 0.1. Branch B has been visited once and has a bad evaluation: Q = –0.5. Branch C has no visits yet but has a prior probability of P = 0.038.
Table 14.2 shows how to calculate the uncertainty component. Branch A has the highest Q component, indicating that you’ve seen some good board positions underneath it. Branch C has the highest UCT component: we’ve never visited, so we have the highest uncertainty around that branch. Branch B has a lower evaluation than A, and more visits than C, so it’s not likely to be a good choice at this point.
Q |
n |
N |
P |
P√N / (n + 1) |
|
---|---|---|---|---|---|
Branch A | 0.1 | 2 | 3 | 0.068 | 0.039 |
Branch B | –0.5 | 1 | 3 | 0.042 | 0.036 |
Branch C | 0 | 0 | 3 | 0.038 | 0.065 |
Assuming you’ve eliminated branch B, how do you choose between branch A and branch C? It depends on the value of the parameter c. A small value of c favors the high-value branch (in this case, A). A large value of c favors the branch with most uncertainty (in this case, C). For example, at c = 1.0, you’d choose A (at a score of 0.139 to 0.065). At c = 4.0, you’d choose C (0.260 to 0.256). Neither is objectively right; it’s just a trade-off. The following listing shows how to calculate this score in Python.
class ZeroAgent(Agent): ... def select_branch(self, node): total_n = node.total_visit_count def score_branch(move): q = node.expected_value(move) p = node.prior(move) n = node.visit_count(move) return q + self.c * p * np.sqrt(total_n) / (n + 1) return max(node.moves(), key=score_branch) 1
After you’ve chosen a branch, you repeat the same calculation on its children to choose the next branch. You continue the same process until you reach a branch with no children.
class ZeroAgent(Agent): ... def select_move(self, game_state): root = self.create_node(game_state) 1 for i in range(self.num_rounds): 2 node = root next_move = self.select_branch(node) while node.has_child(next_move): 3 node = node.get_child(next_move) next_move = self.select_branch(node)
At this point, you’ve reached an unexpanded branch of the tree. You can’t search any further, because there’s no node in the tree corresponding to the current move. The next step is to create a new node and add it to the tree.
To create a new node, you take the previous game state and apply the current move to get a new game state. You can then feed the new game state to your neural network, which gives you two valuable things. First, you get the prior estimates for all possible follow-up moves from the new game state. Second, you get the estimated value of the new game state. You use this information to initialize the statistics for the branches from this new node.
class ZeroAgent(Agent): ... def create_node(self, game_state, move=None, parent=None): state_tensor = self.encoder.encode(game_state) model_input = np.array([state_tensor]) 1 priors, values = self.model.predict(model_input) priors = priors[0] 2 value = values[0][0] 2 move_priors = { 3 self.encoder.decode_move_index(idx): p 3 for idx, p in enumerate(priors) 3 } 3 new_node = ZeroTreeNode( game_state, value, move_priors, parent, move) if parent is not None: parent.add_child(move, new_node) return new_node
Finally, you walk back up the tree and update the statistics for each parent that lead to this node, as shown in figure 14.4. For each node in this path, you increment the visit count and update the total expected value. At each node, the perspective switches between the black player and the white player, so you need to flip the sign of the value at each step.
class ZeroTreeNode: ... def record_visit(self, move, value): self.total_visit_count += 1 self.branches[move].visit_count += 1 self.branches[move].total_value += value class ZeroAgent(Agent): ... def select_move(self, game_state): ... new_state = node.state.apply_move(next_move) child_node = self.create_node( new_state, parent=node) move = next_move value = -1 * child_node.value 1 while node is not None: node.record_visit(move, value) move = node.last_move node = node.parent value = -1 * value
This whole process repeats over and over, and the tree expands each time. AGZ used 1,600 rounds per move during the self-play process. In competitive games, you should run the algorithm for as many rounds as you have time for. The bot will choose better and better moves as it performs more rounds.
After you’ve built up the search tree as deeply as you can, it’s time to select a move. The simplest rule for move selection is to pick the move with the highest visit count.
Why use the visit counts and not the expected value? You can assume the branch with the most visits has a high expected value. Here’s why. Refer to the preceding branch-selection formula. As the number of visits to a branch grows, the factor of 1 / (n + 1) gets smaller and smaller. Therefore, the branch selection function will just pick based on Q. The branches with higher Q values get the most visits.
Now, if a branch has only a few visits, anything’s possible. It may have a small Q or a huge Q. But you also can’t trust its estimate based on a small number of visits. If you just picked the branch with the highest Q, you might get a branch with just one visit, and its true value may be much smaller. That’s why you select based on visit counts instead. It guarantees that you pick a branch with a high estimated value and a reliable estimate.
class ZeroAgent(Agent): ... def select_move(self, game_state): ... return max(root.moves(), key=root.visit_count)
In contrast to other self-play agents in this book, the ZeroAgent has no special logic around when to pass. That’s because you include passing in the search tree: you can treat it just like any other move.
With our ZeroAgent implementation complete, you can now implement your simulate_game function for self-play.
def simulate_game( board_size, black_agent, black_collector, white_agent, white_collector): print('Starting the game!') game = GameState.new_game(board_size) agents = { Player.black: black_agent, Player.white: white_agent, } black_collector.begin_episode() white_collector.begin_episode() while not game.is_over(): next_move = agents[game.next_player].select_move(game) game = game.apply_move(next_move) game_result = scoring.compute_game_result(game) if game_result.winner == Player.black: black_collector.complete_episode(1) white_collector.complete_episode(-1) else: black_collector.complete_episode(-1) white_collector.complete_episode(1)
The training target for the value output is a 1 if the agent won the game, and a –1 if it lost. By averaging over many games, you learn a value between those extremes that indicates your bot’s chance of winning. It’s the exact same setup you used for Q-learning (in chapter 11) and for actor-critic learning (chapter 12).
The action output is a little different. Just as in policy learning (chapter 10) and actor-critic learning (chapter 12), the neural network has an output that produces a probability distribution over legal moves. In policy learning, you trained a network to match the exact move that the agent chose (in cases where the agent won the game). AGZ works differently in a subtle way. It trains its network to match the number of times it visited each move during the tree search.
To illustrate how that can improve its playing strength, think about how the MCTS-style search algorithms work. Assume, for the moment, that you have a value function that’s at least vaguely correct; it doesn’t have to be super precise as long as it roughly separates winning positions from losing positions. Then imagine that you throw out the prior probabilities entirely and run the search algorithm. By design, the search will spend more time in the most promising branches. The branch selection logic makes this happen: the Q component in the UCT formula means high-value branches are selected more often. If you had unlimited time to run the search, it’d eventually converge on the best moves.
After a sufficient number of rounds in the tree search, you can think of the visit counts as the source of truth. You know whether these moves are good or bad because you checked what happens if you play them. So the search counts become your target values for training the prior function.
The prior function tries to predict where the tree search would spend its time, if you gave it plenty of time to run. Armed with a function trained on previous runs, your tree search can save time and go straight into searching out the more important branches. With an accurate prior function, your search algorithm can spend just a small number of rollouts, but get similar results to a slower search that required a much larger number of rollouts. In a sense, you can think that the network is “remembering” what happened in previous searches and using that knowledge to skip ahead.
To set up training in this way, you need to store the search counts after each move. In previous chapters, you used a generic ExperienceCollector that could apply to many RL implementations. In this case, however, the search counts are specific to AGZ, so you’ll make a custom collector. The structure is mostly the same.
class ZeroExperienceCollector: def __init__(self): self.states = [] self.visit_counts = [] self.rewards = [] self._current_episode_states = [] self._current_episode_visit_counts = [] def begin_episode(self): self._current_episode_states = [] self._current_episode_visit_counts = [] def record_decision(self, state, visit_counts): self._current_episode_states.append(state) self._current_episode_visit_counts.append(visit_counts) def complete_episode(self, reward): num_states = len(self._current_episode_states) self.states += self._current_episode_states self.visit_counts += self._current_episode_visit_counts self.rewards += [reward for _ in range(num_states)] self._current_episode_states = [] self._current_episode_visit_counts = []
class ZeroAgent(Agent): ... def select_move(self, game_state): ... if self.collector is not None: root_state_tensor = self.encoder.encode(game_state) visit_counts = np.array([ root.visit_count( self.encoder.decode_move_index(idx)) for idx in range(self.encoder.num_moves()) ]) self.collector.record_decision( root_state_tensor, visit_counts)
The action output of your neural network uses a softmax activation. Recall that the softmax activation ensures that its values sum to 1. For training, you should also make sure that the training target sums to 1. To do this, divide the total visit counts by its sum; this operation is called normalizing. Figure 14.5 shows an example.
Other than that, the training process looks similar to training an actor-critic network in chapter 12. The following listing shows the implementation.
class ZeroAgent(Agent): ... def train(self, experience, learning_rate, batch_size): 1 num_examples = experience.states.shape[0] model_input = experience.states visit_sums = np.sum( 2 experience.visit_counts, axis=1).reshape( 2 (num_examples, 1)) 2 action_target = experience.visit_counts / visit_sums 2 value_target = experience.rewards self.model.compile( SGD(lr=learning_rate), loss=['categorical_crossentropy', 'mse']) self.model.fit( model_input, [action_target, value_target], batch_size=batch_size)
The overall reinforcement-learning cycle is the same as what you studied in chapters 9 through 12:
Listing 14.14 shows an example that runs a single cycle of this process. Fair warning: you’ll need a lot of self-play games to build a strong Go AI from nothing. AlphaGo Zero reached a superhuman level of play, but it took nearly 5 million self-play games to get there.
board_size = 9 encoder = zero.ZeroEncoder(board_size) board_input = Input(shape=encoder.shape(), name='board_input') pb = board_input for i in range(4): 1 pb = Conv2D(64, (3, 3), 1 padding='same', 1 data_format='channels_first', 1 activation='relu')(pb) 1 policy_conv = 2 Conv2D(2, (1, 1), 2 data_format='channels_first', 2 activation='relu')(pb) 2 policy_flat = Flatten()(policy_conv) 2 policy_output = 2 Dense(encoder.num_moves(), activation='softmax')( 2 policy_flat) 2 value_conv = 3 Conv2D(1, (1, 1), 3 data_format='channels_first', 3 activation='relu')(pb) 3 value_flat = Flatten()(value_conv) 3 value_hidden = Dense(256, activation='relu')(value_flat) 3 value_output = Dense(1, activation='tanh')(value_hidden) 3 model = Model( inputs=[board_input], outputs=[policy_output, value_output]) black_agent = zero.ZeroAgent( model, encoder, rounds_per_move=10, c=2.0) 4 white_agent = zero.ZeroAgent( model, encoder, rounds_per_move=10, c=2.0) c1 = zero.ZeroExperienceCollector() c2 = zero.ZeroExperienceCollector() black_agent.set_collector(c1) white_agent.set_collector(c2) for i in range(5): 5 simulate_game(board_size, black_agent, c1, white_agent, c2) exp = zero.combine_experience([c1, c2]) black_agent.train(exp, 0.01, 2048)
Self-play reinforcement learning is an inherently random process. Your bot can easily drift in a weird direction, especially early in the training process. To prevent the bot from getting stuck, it’s important to provide a little randomness. That way, if the bot gets fixated on a really terrible move, it’ll have a small chance to learn a better move. In this section, we describe one of the tricks AGZ used to ensure good exploration.
In previous chapters, you used a few different techniques to add variety to a bot’s selections. For example, in chapter 9, you randomly sampled from your bot’s policy output; in chapter 11, you used the ϵ-greedy algorithm: some fraction ϵ of the time, the bot would ignore its model entirely and choose a totally random move instead. In both cases, you added randomness at the time the bot made a decision. AGZ uses a different method to introduce randomness earlier, during the search process.
Imagine that on each turn, you artificially boosted the prior of one or two randomly chosen moves. Early in the search process, the prior controls which branches get explored, so those moves will get extra visits. If they turn out to be bad moves, the search will quickly move on to other branches, so no harm done. But this would ensure that every move gets a few visits occasionally, so the search doesn’t develop blind spots.
AGZ achieves a similar effect by adding noise—small random numbers—to the priors at the root of each search tree. By drawing the noise from a Dirichlet distribution, you get the exact effect described previously: a few moves get an artificial boost, while the others stay untouched. In this section, we explain the properties of the Dirichlet distribution and show how to generate Dirichlet noise by using NumPy.
Throughout this book, you’ve used probability distributions over game moves. When you sample from such a distribution, you get a particular move. The Dirichlet distribution is a probability distribution over probability distributions: when you sample from the Dirichlet distribution, you get another probability distribution. The NumPy function np.random.dirichlet generates samples from a Dirichlet distribution. It takes a vector argument and returns a vector of the same dimension. The following listing shows a few example draws: the result is a vector, and it always sums to 1—meaning the result is itself a valid probability distribution.
>>> import numpy as np >>> np.random.dirichlet([1, 1, 1]) array([0.1146, 0.2526, 0.6328]) >>> np.random.dirichlet([1, 1, 1]) array([0.1671, 0.5378, 0.2951]) >>> np.random.dirichlet([1, 1, 1]) array([0.4098, 0.1587, 0.4315])
You can control the output of the Dirichlet distribution with a concentration parameter, usually denoted as α. When α is close to 0, the Dirichlet distribution will generate “lumpy” vectors: most of the values with be close to 0, and just a few values will be larger. When α is large, the samples will be “smooth”: the values will be closer to each other. The following listing shows the effect of changing the concentration parameter.
>>> import numpy as np >>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1]) 1 array([0. , 0.044 , 0.7196, 0.2364]) >>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1]) array([0.0015, 0.0028, 0.9957, 0. ]) >>> np.random.dirichlet([0.1, 0.1, 0.1, 0.1]) array([0. , 0.9236, 0.0002, 0.0763]) >>> np.random.dirichlet([10, 10, 10, 10]) 2 array([0.3479, 0.1569, 0.3109, 0.1842]) >>> np.random.dirichlet([10, 10, 10, 10]) array([0.3731, 0.2048, 0.0715, 0.3507]) >>> np.random.dirichlet([10, 10, 10, 10]) array([0.2119, 0.2174, 0.3042, 0.2665])
This shows a recipe for modifying your priors. By choosing a small α, you get a distribution in which a few moves have high probabilities, and the rest are close to zero. Then you can take a weighted average of the true priors with the Dirichlet noise. AGZ used a concentration parameter of 0.03.
Neural network design is a hot research topic. One never-ending problem is how to make training stable on deeper and deeper networks. AlphaGo Zero applied a couple of cutting-edge techniques that are quickly becoming standards. The details are beyond the scope of this book, but we introduce them at a high level here.
The idea of deep neural networks is that each layer can learn an increasingly high-level representation of the original data. But what exactly are these representations? What we mean is that some meaningful property of the original data will show up as a particular numeric value in the activation of a particular neuron in the layer. But the mapping between actual numbers is completely arbitrary. If you multiply every activation in a layer by 2, for example, you haven’t lost any information: you’ve just changed the scale. In principle, such a transformation doesn’t affect the network’s capacity to learn.
But the absolute value of the activations can affect practical training performance. The idea of batch normalization is to shift each layer’s activations so they’re centered around 0, and scale them so the variance is 1. At the start of training, you don’t know what the activations will look like. Batch normalization provides a scheme for learning the correct shift and scale on the fly during the training process; the normalizing transform will adjust as its inputs change throughout training.
How does batch normalization improve training? That’s still an open area of research. The original researchers developed batch normalization in order to reduce covariate shift. The activations in any layer tend to drift during training. Batch normalization corrects that drift, reducing the learning burden on the later layers. But the latest research suggests that the covariate shift may not be as important as initially thought. Instead, the value may be in the way batch normalization makes the loss function smoother.
Although researchers are still studying why batch normalization works, it’s well established that it does work. Keras provides a BatchNormalization layer that you can add to your networks. The following listing shows an example of adding batch normalization to a convolution layer in Keras.
from keras.models import Sequential from keras.layers import Activation, BatchNormalization, Conv2D model = Sequential() model.add(Conv2D(64, (3, 3), data_format='channels_first')) model.add(BatchNormalization(axis=1)) 1 model.add(Activation('relu')) 2
Imagine you’ve successfully trained a neural network that has three hidden layers in the middle. What happens if you add a fourth layer? In theory, this should strictly increase the capacity of your network. In the worst case, when you train the four-layer network, the first three layers could learn the same things they did in the three-layer network, while the fourth layer just passes its input untouched. You’d hope it could also learn more, but you wouldn’t expect it to learn less. At least, you’d expect the deeper network should be able to overfit (memorize the training set in a way that doesn’t necessarily translate to new examples).
In reality, this doesn’t always happen. When you try to train a four-layer network, there are more possible ways to organize the data than on a three-layer network. Sometimes, because of the quirks of stochastic gradient descent on complicated loss surfaces, you can add more layers and find you can’t even overfit. The idea of a residual network is to simplify what the extra layer is trying to learn. If three layers can do an OK job learning a problem, you can force the fourth layer to focus on learning the gap between whatever the first three layers learned and the objective. (That gap is the residual, hence the name.)
To implement this, you sum the input to your extra layers with the output from your extra layers, as illustrated in figure 14.6. The connection from the previous layer to the summation layer is called a skip connection. Normally, residual networks are organized into small blocks; there are around two or three layers per block with a skip connection beside them. Then you can stack as many blocks together as needed.
If you’re interested in experimenting with AlphaGo Zero-style bots, there are many open source projects inspired by the original AGZ paper. If you want a superhuman Go AI, either to play against or to study the source code, you have an embarrassment of riches.
That wraps up our introduction to the cutting-edge AI techniques that power modern Go AIs. We encourage you to take matters into your own hands from here: either experiment with your own Go bot, or try applying these modern techniques to other games.
But also think beyond games. When you read about the latest application of machine learning, you now have a framework for understanding what’s happening. Think about the following:
We hope that we’ve inspired you to try your own experiments with deep learning, whether they’re in games or another field.