This chapter covers
If you’re learning to play Go, one of the best ways to improve is to get a stronger player to review your games. Sometimes the most useful feedback just points out where you won or lost the game. The reviewer might give comments like, “You were already far behind by move 30” or “At move 110, you had a winning position, but your opponent turned it around by move 130.”
Why is this feedback helpful? You may not have time to scrutinize all 300 moves in a game, but you can focus your full attention on a 10- or 20-move sequence. The reviewer lets you know which parts of the game are important.
Reinforcement-learning researchers apply this principle in actor-critic learning, which is a combination of policy learning (as covered in chapter 10) and value learning (as covered in chapter 11). The policy function plays the role of the actor: it picks what moves to play. The value function is the critic: it tracks whether the agent is ahead or behind in the course of the game. That feedback guides the training process, in the same way that a game review can guide your own study.
This chapter describes how to make a self-improving game AI with actor-critic learning. The key concept that makes it all work is advantage, the difference between the actual game outcome and the expected outcome. We start by illustrating how advantage can improve the training process. After that, we’re ready to build an actor-critic game agent. First we show how to implement move selection; then we implement the new training process. In both functions, we borrow heavily from the code examples in chapters 10 and 11. The end result is the best of both worlds: it combines the benefits of policy learning and Q-learning into one agent.
In chapter 10, we briefly mentioned the credit-assignment problem. Suppose your learning agent played a game with 200 moves and ultimately won the game. Because it won, you can assume it chose at least a few good moves, but it probably chose a couple of bad moves as well. Credit assignment is the problem of separating the good moves, which you want to reinforce, from the bad moves, which you should ignore. This section introduces the concept of advantage, a formula for estimating how much a particular decision contributed to the final result. First we describe how advantage helps with credit assignment; then we provide code samples showing how to calculate it.
Imagine you’re watching a basketball game; while the fourth quarter ticks down, your favorite player nails a three-pointer. How excited do you get? It depends on the game state. If the score is 80 to 78, you’re probably jumping out of your seat. If the score is 110 to 80, you’re indifferent. What’s the difference? In a close game, a three-point swing creates a huge change in the expected outcome of the game. On the other hand, if the game is a blowout, a single play won’t affect the result. The most important plays happen while the outcome is still in doubt. In reinforcement learning, advantage is a formula that quantifies this concept.
To calculate advantage, you first need an estimate of the value of a state, which we denote as V(s). This is the expected return the agent will see, given that it has already arrived at a particular state s. In games, you can think of V(s) as indicating whether the board position is good for black or white. If V(s) is close to 1, your agent is in a favorable position; if V(s) is close to –1, your agent is losing.
If you recall the action-value function Q(s,a) from the previous chapter, the concept is similar. The difference is that V(s) represents how favorable the board is before you choose a move; Q(s,a) represents how favorable the board is after you choose a move.
The definition of advantage is usually specified as follows:
A = Q(s, a) – V(s) |
One way to think of this is that if you’re in a good state (that is, V(s) is high), but you make a terrible move (Q(s,a) is low), you give away your advantage: hence the calculation is negative. One problem with this formula, however, is that you don’t know how to calculate Q(s,a). But you can consider the reward you get at the end of the game as an unbiased estimate of the true Q. So you can wait until you get your reward R, and then estimate the advantage as follows:
A = R – V(s) |
That’s the calculation you’ll use to estimate advantage throughout this chapter. Let’s see how this value is useful.
For the purposes of illustration, you’ll pretend that you already have an accurate way to estimate V(s). In reality, your agent learns its value-estimating function and its policy function simultaneously. The next section covers how that works.
Let’s work through a few examples:
Figures 12.1 and 12.2 illustrate the advantage calculation for a hypothetical game. In this game, your learning agent slowly pulled ahead over the first few moves; then it made some big mistakes and fell all the way to a lost position. Somewhere before move 150, it suddenly managed to reverse the game and finally cruised to a win. Under the policy gradient technique from chapter 10, you’d weight each move equally in this game. With actor-critic learning, you want to find the most important moves and give them greater weight. The advantage calculation shows you how.
Because the learning agent won, the advantage is given by A(s) = 1 – V(s). In figure 12.2, you can see that the advantage curve has the same shape as the estimated value curve, but flipped upside down. The largest advantage comes while the agent was far behind. Because most players would lose in such a bad situation, the agent must have made a great move somewhere.
After the agent had already pulled back ahead, around move 160 or so, its decisions are no longer interesting: the game had already wrapped up. The advantage in that section is close to 0.
Later in this chapter, we show how to adjust the training process based on the advantages. Before that, you need to calculate and store advantage through your self-play process.
To calculate advantage, you’ll update your ExperienceCollector that you defined in chapter 9. Originally, an experience buffer tracked three parallel arrays: states, actions, and rewards. You can add a fourth parallel array to track advantages. To fill this array, you need both the estimated value for each state and the final game outcome. You won’t have the latter until the end; so in the middle of the episode, you can accumulate estimated values, and when the game is complete, you can translate those into advantages.
class ExperienceCollector: def __init__(self): self.states = [] 1 self.actions = [] 1 self.rewards = [] 1 self.advantages = [] 1 self._current_episode_states = [] 2 self._current_episode_actions = [] 2 self._current_episode_estimated_values = [] 2
Similarly, you need to update the record_decision method to accept an estimated value along with a state and an action.
class ExperienceCollector: ... def record_decision(self, state, action, estimated_value=0): self._current_episode_states.append(state) self._current_episode_actions.append(action) self._current_episode_estimated_values.append( estimated_value)
Then, in the complete_episode method, you can calculate the advantage of each decision the agent made.
class ExperienceCollector: ... def complete_episode(self, reward): num_states = len(self._current_episode_states) self.states += self._current_episode_states self.actions += self._current_episode_actions self.rewards += [reward for _ in range(num_states)] for i in range(num_states): advantage = reward - 1 self._current_episode_estimated_values[i] 1 self.advantages.append(advantage) 1 self._current_episode_states = [] 2 self._current_episode_actions = [] 2 self._current_episode_estimated_values = [] 2
You also need to update the ExperienceBuffer class and combine_experience helper to handle the advantages.
class ExperienceBuffer: def __init__(self, states, actions, rewards, advantages): self.states = states self.actions = actions self.rewards = rewards self.advantages = advantages def serialize(self, h5file): h5file.create_group('experience') h5file['experience'].create_dataset('states', data=self.states) h5file['experience'].create_dataset('actions', data=self.actions) h5file['experience'].create_dataset('rewards', data=self.rewards) h5file['experience'].create_dataset('advantages', data=self.advantages) def combine_experience(collectors): combined_states = np.concatenate( [np.array(c.states) for c in collectors]) combined_actions = np.concatenate( [np.array(c.actions) for c in collectors]) combined_rewards = np.concatenate( [np.array(c.rewards) for c in collectors]) combined_advantages = np.concatenate([ np.array(c.advantages) for c in collectors]) return ExperienceBuffer( combined_states, combined_actions, combined_rewards, combined_advantages)
Your experience classes are now ready to track advantage. You can still use these classes with techniques that don’t rely on advantage; just ignore the contents of the advantages buffer while training.
Chapter 11 showed how to define a neural network with two inputs in Keras. The Q-learning network had one input for the board and one input for the proposed move. For actor-critic learning, you want a network with one input and two outputs. The input is a representation of the board state. One output is a probability distribution over moves—the actor. The other output represents the expected return from the current position—the critic.
Building a network with two outputs brings a surprising bonus: each output serves as a sort of regularizer on the other. (Recall from chapter 6 that regularization is any technique to prevent your model from overfitting to the exact data set it was trained on.) Imagine that a group of stones on the board is in danger of getting captured. This fact is relevant for the value output, because the player with the weak stones is probably behind. It’s also relevant to the action output, because you probably want to either attack or defend the weak stones. If your network learns a “weak stone” detector in the early layers, that’s relevant to both outputs. Training on both outputs forces the network to learn a representation that’s useful for both goals. This can often improve generalization and sometimes even speed up training.
Chapter 11 introduced the Keras functional API, which gives you full freedom to connect layers in your network however you like. You’ll use it again here to build the network described in figure 12.3; this code goes in the init_ac_agent.py script.
from keras.models import Model from keras.layers import Conv2D, Dense, Flatten, Input board_input = Input(shape=encoder.shape(), name='board_input') conv1 = Conv2D(64, (3, 3), 1 padding='same', 1 activation='relu')(board_input) 1 conv2 = Conv2D(64, (3, 3), 1 padding='same', 1 activation='relu')(conv1) 1 conv3 = Conv2D(64, (3, 3), 1 padding='same', 1 activation='relu')(conv2) 1 flat = Flatten()(conv3) processed_board = Dense(512)(flat) 2 policy_hidden_layer = Dense( 3 512, activation='relu')(processed_board) 3 policy_output = Dense( 3 encoder.num_points(), activation='softmax')( 3 policy_hidden_layer) 3 value_hidden_layer = Dense( 4 512, activation='relu')( 4 processed_board) 4 value_output = Dense(1, activation='tanh')( 4 value_hidden_layer) 4 model = Model(inputs=board_input, outputs=[policy_output, value_output])
This network has three convolutional layers with 64 filters each. That’s on the smaller side for a Go-playing network, but it has the advantage of faster training. As always, we encourage you to experiment with different network structures here.
The policy output represents a probability distribution over possible moves. The dimension is equal to the number of points on the board, and you use the softmax activation to ensure that the policy sums to 1.
The value output is a single number in the range of –1 to 1. This output has dimension 1, and you use a tanh activation to clamp the value.
Selecting moves is almost exactly the same as in the policy agent from chapter 10. You make two changes. First, because the model now produces two outputs, you need a little extra code to unpack the results. Second, you need to pass the estimated value to the experience collector, along with the state and action. The process of picking a move from the probability distribution is identical. The following listing shows the updated select_move implementation. We’ve called out places where it differs from implementation of chapter 10’s policy agent.
class ACAgent(Agent): ... def select_move(self, game_state): num_moves = self.encoder.board_width * self.encoder.board_height board_tensor = self.encoder.encode(game_state) X = np.array([board_tensor]) actions, values = self.model.predict(X) 1 move_probs = actions[0] 2 estimated_value = values[0][0] 3 eps = 1e-6 move_probs = np.clip(move_probs, eps, 1 - eps) move_probs = move_probs / np.sum(move_probs) candidates = np.arange(num_moves) ranked_moves = np.random.choice( candidates, num_moves, replace=False, p=move_probs) for point_idx in ranked_moves: point = self.encoder.decode_point_index(point_idx) move = goboard.Move.play(point) move_is_valid = game_state.is_valid_move(move) fills_own_eye = is_point_an_eye( game_state.board, point, game_state.next_player) if move_is_valid and (not fills_own_eye): if self.collector is not None: self.collector.record_decision( 4 state=board_tensor, 4 action=point_idx, 4 estimated_value=estimated_value 4 ) return goboard.Move.play(point) return goboard.Move.pass_turn()
Training your actor-critic network looks like a combination of training the policy network in chapter 10 and the action-value network in chapter 11. To train a two-output network, you construct a separate training target for each output, and choose a separate loss function for each output. This section describes how to convert the experience data to training targets, and how to use the Keras fit function with multiple outputs.
Recall how you encoded training data for policy gradient learning. For any game position, the training target was a vector the same size as the board, with a 1 or –1 in the slot corresponding to the chosen move; the 1 indicated a win, and the –1 indicated a loss. In your actor-critic learning, you use the same encoding scheme for the training data, but you replace the 1 or –1 with the advantage of the move. The advantage will have the same sign as the final reward, so the probability of the game decision will move in the same direction as in simple policy learning. But it’ll move further for actions that were deemed important, and move just a little for actions with an advantage that’s close to zero.
For the value output, the training target is the total reward. This looks exactly like the training target for Q-learning. Figure 12.4 illustrates the training setup.
When you have multiple outputs in a network, you can pick a different loss function for each output. You’ll use categorical cross-entropy for the policy output, and mean squared error for the value output. (Refer to chapters 10 and 11 for an explanation of why those loss functions make sense for those purposes.)
One new Keras feature you’ll use is loss weights. By default, Keras will sum the loss function for each output to get the overall loss function. If you specify loss weights, Keras will scale each individual loss function before summing. This allows you to adjust the relative importance of each output. In our experiments, we found the value loss was large compared to the policy loss, so we scaled down the value loss by half. Depending on your exact network and training data, you may need to adjust the loss weights somewhat.
Keras will print out the computed loss values every time you call fit. For a two-output network, it’ll print out the two losses separately. You can check there to see whether the magnitudes are comparable. If one loss is far larger than the other, consider adjusting the weights. Don’t worry about getting too precise.
The following listing shows how to encode the experience data as training data, and then call fit on the training targets. The structure is similar to the train implementations from chapters 10 and 11.
class ACAgent(Agent): ... def train(self, experience, lr=0.1, batch_size=128): 1 opt = SGD(lr=lr) self.model.compile( optimizer=opt, loss=['categorical_crossentropy', 'mse'], 2 loss_weights=[1.0, 0.5]) 3 n = experience.states.shape[0] num_moves = self.encoder.num_points() policy_target = np.zeros((n, num_moves)) value_target = np.zeros((n,)) for i in range(n): action = experience.actions[i] 4 policy_target[i][action] = experience.advantages[i] 4 reward = experience.rewards[i] 5 value_target[i] = reward 5 self.model.fit( experience.states, [policy_target, value_target], batch_size=batch_size, epochs=1)
Now that you have all the pieces, let’s try actor-critic learning end-to-end. You’ll start with a 9 × 9 bot so you can see results faster. The cycle will go like this:
The benchmark of 60 wins out of 100 is somewhat arbitrary; it’s a nice round number that gives you reasonable confidence that your bot is truly stronger, and not just lucky.
Start by initializing a bot with the init_ac_agent script (as shown in listing 12.5):
python init_ac_agent.py --board-size 9 ac_v1.hdf5
After this, you should have a new file, ac_v1.hdf5, that contains the weights for your new bot. At this point, both the bot’s play and its value estimates are essentially random. You can now start generating self-play games:
python self_play_ac.py --board-size 9 --learning-agent ac_v1.hdf5 --num-games 5000 --experience-out exp_0001.hdf5
If you’re not fortunate enough to have access to a fast GPU, this is a good time to go out for a coffee or take the dog for a walk. When the self_play script is done, the output will look something like this:
Simulating game 1/5000... 9 ooxxxxxxx 8 ooox.xx.x 7 oxxxxooxx 6 oxxxxxox. 5 oooooxoxx 4 ooo.oooxo 3 ooooooooo 2 .oo.ooo.o 1 ooooooooo ABCDEFGHJ W+28.5 ... Simulating game 5000/5000... 9 x.x.xxxxx 8 xxxxx.xxx 7 .x.xxxxoo 6 xxxx.xo.o 5 xxxxxxooo 4 xooooooxo 3 xoooxxxxo 2 o.o.oxxxx 1 ooooox.x. ABCDEFGHJ B+15.5
After this, you should have an exp_0001.hdf5 file containing a big chunk of game records. The next step is to train:
python train_ac.py --learning-agent bots/ac_v1.hdf5 --agent-out bots/ac_v2.hdf5 --lr 0.01 --bs 1024 exp_0001.hdf5
This will take the neural network currently stored in ac_v1.hdf1, run a single epoch of training against the data in exp_0001.hdf, and save the updated agent to ac_v2.hdf5. The optimizer will use a learning rate of 0.01 and a batch size of 1,024. You should see output something like this:
Epoch 1/1 574234/574234 [==============================] - 15s 26us/step - loss: 1.0277 - dense_3_loss: 0.6403 - dense_5_loss: 0.7750
Notice that the loss is now broken into two values: dense_3_loss and dense_5_loss, corresponding to the policy output and the value output, respectively.
After this, you can compare the updated bot against its predecessor with the eval_ac_bot.py script:
python eval_ac_bot.py --agent1 bots/ac_v2.hdf5 --agent2 bots/ac_v1.hdf5 --num-games 100
The output should look something like this:
... Simulating game 100/100... 9 oooxxxxx. 8 .oox.xxxx 7 ooxxxxxxx 6 .oxx.xxxx 5 oooxxx.xx 4 o.ox.xx.x 3 ooxxxxxxx 2 ooxx.xxxx 1 oxxxxxxx. ABCDEFGHJ B+31.5 Agent 1 record: 60/100
In this case, the output shows you exactly hit the threshold of 60 wins out of 100: you can have reasonable confidence that your bot has learned something useful. (This is just example output, of course; your actual results will look a little different, and that’s fine.) Because the ac_v2 bot is measurably stronger than ac_v1, you can switch to generating games with ac_v2:
python self_play_ac.py --board-size 9 --learning-agent ac_v2.hdf5 --num-games 5000 --experience-out exp_0002.hdf5
When that’s done, you can train and evaluate again:
python train_ac.py --learning-agent bots/ac_v2.hdf5 --agent-out bots/ac_v3.hdf5 --lr 0.01 --bs 1024 exp_0002.hdf5 python eval_ac_bot.py --agent1 bots/ac_v3.hdf5 --agent2 bots/ac_v2.hdf5 --num-games 100
This case wasn’t quite as successful as the last time:
Agent 1 record: 51/100
The ac_v3 bot beat the ac_v2 bot only 51 times out of 100. With those results, it’s hard to say whether ac_v3 is a tiny bit stronger or not; the safest conclusion is that it’s basically the same strength as ac_v2. But don’t despair. You can generate more training data and try again:
python self_play_ac.py --board-size 9 --learning-agent ac_v2.hdf5 --num-games 5000 --experience-out exp_0002a.hdf5
The train_ac script will accept multiple training data files on the command line:
python train_ac.py --learning-agent ac_v2.hdf5 --agent-out ac_v3.hdf5 --lr 0.01 --bs 1024 exp_0002.hdf5 exp_0002a.hdf5
After each additional batch of games, you can compare against ac_v2 again. In our experiments, we needed three batches of 5,000 games—a total of 15,000 games—before we got a satisfactory result:
Agent 1 record: 62/100
Success! With 62 wins against ac_v2, you can feel confident that ac_v3 is stronger than ac_v2. Now you can switch over to generating self-play games with ac_v3, and repeat the cycle again.
It’s unclear exactly how strong a Go bot can get with this actor-critic implementation alone. We’ve shown that you can train a bot to learn basic tactics, but its strength is bound to top out at some point. By deeply integrating reinforcement learning with a kind of tree search, you can train a bot that’s stronger than any human player; chapter 14 covers that technique.