Chapter 12. Reinforcement learning with actor-critic methods

This chapter covers

  • Using advantage to make reinforcement learning more efficient
  • Making a self-improving game AI with the actor-critic method
  • Designing and training multi-output neural networks in Keras

If you’re learning to play Go, one of the best ways to improve is to get a stronger player to review your games. Sometimes the most useful feedback just points out where you won or lost the game. The reviewer might give comments like, “You were already far behind by move 30” or “At move 110, you had a winning position, but your opponent turned it around by move 130.”

Why is this feedback helpful? You may not have time to scrutinize all 300 moves in a game, but you can focus your full attention on a 10- or 20-move sequence. The reviewer lets you know which parts of the game are important.

Reinforcement-learning researchers apply this principle in actor-critic learning, which is a combination of policy learning (as covered in chapter 10) and value learning (as covered in chapter 11). The policy function plays the role of the actor: it picks what moves to play. The value function is the critic: it tracks whether the agent is ahead or behind in the course of the game. That feedback guides the training process, in the same way that a game review can guide your own study.

This chapter describes how to make a self-improving game AI with actor-critic learning. The key concept that makes it all work is advantage, the difference between the actual game outcome and the expected outcome. We start by illustrating how advantage can improve the training process. After that, we’re ready to build an actor-critic game agent. First we show how to implement move selection; then we implement the new training process. In both functions, we borrow heavily from the code examples in chapters 10 and 11. The end result is the best of both worlds: it combines the benefits of policy learning and Q-learning into one agent.

12.1. Advantage tells you which decisions are important

In chapter 10, we briefly mentioned the credit-assignment problem. Suppose your learning agent played a game with 200 moves and ultimately won the game. Because it won, you can assume it chose at least a few good moves, but it probably chose a couple of bad moves as well. Credit assignment is the problem of separating the good moves, which you want to reinforce, from the bad moves, which you should ignore. This section introduces the concept of advantage, a formula for estimating how much a particular decision contributed to the final result. First we describe how advantage helps with credit assignment; then we provide code samples showing how to calculate it.

12.1.1. What is advantage?

Imagine you’re watching a basketball game; while the fourth quarter ticks down, your favorite player nails a three-pointer. How excited do you get? It depends on the game state. If the score is 80 to 78, you’re probably jumping out of your seat. If the score is 110 to 80, you’re indifferent. What’s the difference? In a close game, a three-point swing creates a huge change in the expected outcome of the game. On the other hand, if the game is a blowout, a single play won’t affect the result. The most important plays happen while the outcome is still in doubt. In reinforcement learning, advantage is a formula that quantifies this concept.

To calculate advantage, you first need an estimate of the value of a state, which we denote as V(s). This is the expected return the agent will see, given that it has already arrived at a particular state s. In games, you can think of V(s) as indicating whether the board position is good for black or white. If V(s) is close to 1, your agent is in a favorable position; if V(s) is close to –1, your agent is losing.

If you recall the action-value function Q(s,a) from the previous chapter, the concept is similar. The difference is that V(s) represents how favorable the board is before you choose a move; Q(s,a) represents how favorable the board is after you choose a move.

The definition of advantage is usually specified as follows:

A = Q(s, a) – V(s)

One way to think of this is that if you’re in a good state (that is, V(s) is high), but you make a terrible move (Q(s,a) is low), you give away your advantage: hence the calculation is negative. One problem with this formula, however, is that you don’t know how to calculate Q(s,a). But you can consider the reward you get at the end of the game as an unbiased estimate of the true Q. So you can wait until you get your reward R, and then estimate the advantage as follows:

A = RV(s)

That’s the calculation you’ll use to estimate advantage throughout this chapter. Let’s see how this value is useful.

For the purposes of illustration, you’ll pretend that you already have an accurate way to estimate V(s). In reality, your agent learns its value-estimating function and its policy function simultaneously. The next section covers how that works.

Let’s work through a few examples:

  • At the beginning of a game, V(s) = 0: both players have a roughly equal chance. Suppose your agent wins the game; then its reward will be 1, so the advantage of its first move is 1 – 0 = 1.
  • Imagine that the game is almost over and your agent has practically locked the game up, so V(s) = 0.95. If your agent does indeed win the game, the advantage from that state is 1 – 0.95 = 0.05.
  • Now imagine your agent has another winning position, where once again V(s) = 0.95. But in this game, your bot somehow blunders away the end game and loses, giving it a reward of –1. Its advantage from that state is –1 – 0.95 = –1.95.

Figures 12.1 and 12.2 illustrate the advantage calculation for a hypothetical game. In this game, your learning agent slowly pulled ahead over the first few moves; then it made some big mistakes and fell all the way to a lost position. Somewhere before move 150, it suddenly managed to reverse the game and finally cruised to a win. Under the policy gradient technique from chapter 10, you’d weight each move equally in this game. With actor-critic learning, you want to find the most important moves and give them greater weight. The advantage calculation shows you how.

Figure 12.1. Estimated values over the course of a hypothetical game. This game lasted 200 moves. In the beginning, the learning agent pulled slightly ahead; then it fell far behind; then it suddenly reversed the game and came out with a win.

Because the learning agent won, the advantage is given by A(s) = 1 – V(s). In figure 12.2, you can see that the advantage curve has the same shape as the estimated value curve, but flipped upside down. The largest advantage comes while the agent was far behind. Because most players would lose in such a bad situation, the agent must have made a great move somewhere.

Figure 12.2. The advantages for each move in a hypothetical game. The learning agent won the game, so its final reward was 1. The moves that led to the comeback have an advantage close to 2, so they’ll be strongly reinforced during training. The moves near the end of the game, when the outcome was already decided, have an advantage close to 0, so they’ll be nearly ignored during training.

After the agent had already pulled back ahead, around move 160 or so, its decisions are no longer interesting: the game had already wrapped up. The advantage in that section is close to 0.

Later in this chapter, we show how to adjust the training process based on the advantages. Before that, you need to calculate and store advantage through your self-play process.

12.1.2. Calculating advantage during self-play

To calculate advantage, you’ll update your ExperienceCollector that you defined in chapter 9. Originally, an experience buffer tracked three parallel arrays: states, actions, and rewards. You can add a fourth parallel array to track advantages. To fill this array, you need both the estimated value for each state and the final game outcome. You won’t have the latter until the end; so in the middle of the episode, you can accumulate estimated values, and when the game is complete, you can translate those into advantages.

Listing 12.1. Updating ExperienceCollector to track advantages
class ExperienceCollector:
    def __init__(self):
        self.states = []                            1
        self.actions = []                           1
        self.rewards = []                           1
        self.advantages = []                        1
        self._current_episode_states = []           2
        self._current_episode_actions = []          2
        self._current_episode_estimated_values = [] 2

  • 1 These can span many episodes.
  • 2 These are reset at the end of every episode.

Similarly, you need to update the record_decision method to accept an estimated value along with a state and an action.

Listing 12.2. Updating ExperienceCollector to store estimated values
class ExperienceCollector:
...
    def record_decision(self, state, action,
            estimated_value=0):
        self._current_episode_states.append(state)
        self._current_episode_actions.append(action)
        self._current_episode_estimated_values.append(
            estimated_value)

Then, in the complete_episode method, you can calculate the advantage of each decision the agent made.

Listing 12.3. Calculating advantage at the end of an episode
class ExperienceCollector:
...
    def complete_episode(self, reward):
        num_states = len(self._current_episode_states)
        self.states += self._current_episode_states
        self.actions += self._current_episode_actions
        self.rewards += [reward for _ in range(num_states)]

        for i in range(num_states):
            advantage = reward -                           1
                self._current_episode_estimated_values[i]   1
            self.advantages.append(advantage)               1

        self._current_episode_states = []                   2
        self._current_episode_actions = []                  2
        self._current_episode_estimated_values = []         2

  • 1 Calculates the advantage of each decision
  • 2 Reset the per-episode buffers.

You also need to update the ExperienceBuffer class and combine_experience helper to handle the advantages.

Listing 12.4. Adding advantage to the ExperienceBuffer structure
class ExperienceBuffer:
    def __init__(self, states, actions, rewards, advantages):
        self.states = states
        self.actions = actions
        self.rewards = rewards
        self.advantages = advantages

    def serialize(self, h5file):
        h5file.create_group('experience')
        h5file['experience'].create_dataset('states',
data=self.states)
        h5file['experience'].create_dataset('actions',
data=self.actions)
        h5file['experience'].create_dataset('rewards',
data=self.rewards)
        h5file['experience'].create_dataset('advantages',
data=self.advantages)


def combine_experience(collectors):
    combined_states = np.concatenate(
       [np.array(c.states) for c in collectors])
    combined_actions = np.concatenate(
       [np.array(c.actions) for c in collectors])
    combined_rewards = np.concatenate(
       [np.array(c.rewards) for c in collectors])
    combined_advantages = np.concatenate([
        np.array(c.advantages) for c in collectors])

    return ExperienceBuffer(
        combined_states,
        combined_actions,
        combined_rewards,
        combined_advantages)

Your experience classes are now ready to track advantage. You can still use these classes with techniques that don’t rely on advantage; just ignore the contents of the advantages buffer while training.

12.2. Designing a neural network for actor-critic learning

Chapter 11 showed how to define a neural network with two inputs in Keras. The Q-learning network had one input for the board and one input for the proposed move. For actor-critic learning, you want a network with one input and two outputs. The input is a representation of the board state. One output is a probability distribution over moves—the actor. The other output represents the expected return from the current position—the critic.

Building a network with two outputs brings a surprising bonus: each output serves as a sort of regularizer on the other. (Recall from chapter 6 that regularization is any technique to prevent your model from overfitting to the exact data set it was trained on.) Imagine that a group of stones on the board is in danger of getting captured. This fact is relevant for the value output, because the player with the weak stones is probably behind. It’s also relevant to the action output, because you probably want to either attack or defend the weak stones. If your network learns a “weak stone” detector in the early layers, that’s relevant to both outputs. Training on both outputs forces the network to learn a representation that’s useful for both goals. This can often improve generalization and sometimes even speed up training.

Chapter 11 introduced the Keras functional API, which gives you full freedom to connect layers in your network however you like. You’ll use it again here to build the network described in figure 12.3; this code goes in the init_ac_agent.py script.

Listing 12.5. A two-output network with a policy output and a value output
from keras.models import Model
from keras.layers import Conv2D, Dense, Flatten, Input

board_input = Input(shape=encoder.shape(), name='board_input')

conv1 = Conv2D(64, (3, 3),                        1
               padding='same',                    1
               activation='relu')(board_input)    1
conv2 = Conv2D(64, (3, 3),                        1
               padding='same',                    1
               activation='relu')(conv1)          1
conv3 = Conv2D(64, (3, 3),                        1
               padding='same',                    1
               activation='relu')(conv2)          1

flat = Flatten()(conv3)
processed_board = Dense(512)(flat)                2

policy_hidden_layer = Dense(                      3
    512, activation='relu')(processed_board)      3
policy_output = Dense(                            3
    encoder.num_points(), activation='softmax')(  3
    policy_hidden_layer)                          3

value_hidden_layer = Dense(                       4
    512, activation='relu')(                      4
    processed_board)                              4
value_output = Dense(1, activation='tanh')(       4
    value_hidden_layer)                           4

model = Model(inputs=board_input,
  outputs=[policy_output, value_output])

  • 1 Add as many convolutional layers as you like.
  • 2 This example uses hidden layers of size 512. Experiment to find the best size. The three hidden layers don’t need to be the same size.
  • 3 This output yields the policy function.
  • 4 This output yields the value function.
Figure 12.3. A neural network suitable for actor-critic learning for Go. This network has a single input, which takes a representation of the current board position. The network produces two outputs. One output indicates which moves it should play—this is the policy output, or the actor. The other output indicates which player is ahead in the game—this is the value output, or the critic. The critic isn’t used in playing a game but helps the training process.

This network has three convolutional layers with 64 filters each. That’s on the smaller side for a Go-playing network, but it has the advantage of faster training. As always, we encourage you to experiment with different network structures here.

The policy output represents a probability distribution over possible moves. The dimension is equal to the number of points on the board, and you use the softmax activation to ensure that the policy sums to 1.

The value output is a single number in the range of –1 to 1. This output has dimension 1, and you use a tanh activation to clamp the value.

12.3. Playing games with an actor-critic agent

Selecting moves is almost exactly the same as in the policy agent from chapter 10. You make two changes. First, because the model now produces two outputs, you need a little extra code to unpack the results. Second, you need to pass the estimated value to the experience collector, along with the state and action. The process of picking a move from the probability distribution is identical. The following listing shows the updated select_move implementation. We’ve called out places where it differs from implementation of chapter 10’s policy agent.

Listing 12.6. Selecting a move for an actor-critic agent
class ACAgent(Agent):
...
    def select_move(self, game_state):
        num_moves = self.encoder.board_width * 
self.encoder.board_height

        board_tensor = self.encoder.encode(game_state)
        X = np.array([board_tensor])

        actions, values = self.model.predict(X)            1
        move_probs = actions[0]                            2
        estimated_value = values[0][0]                     3

        eps = 1e-6
        move_probs = np.clip(move_probs, eps, 1 - eps)
        move_probs = move_probs / np.sum(move_probs)

        candidates = np.arange(num_moves)
        ranked_moves = np.random.choice(
            candidates, num_moves, replace=False, p=move_probs)
        for point_idx in ranked_moves:
            point = self.encoder.decode_point_index(point_idx)
            move = goboard.Move.play(point)
            move_is_valid = game_state.is_valid_move(move)
            fills_own_eye = is_point_an_eye(
                game_state.board, point,
game_state.next_player)
            if move_is_valid and (not fills_own_eye):
                if self.collector is not None:
                    self.collector.record_decision(        4
                        state=board_tensor,                4
                        action=point_idx,                  4
                        estimated_value=estimated_value    4
                    )
                return goboard.Move.play(point)
        return goboard.Move.pass_turn()

  • 1 Because this is a two-output model, predict returns a tuple containing two NumPy arrays.
  • 2 predict is a batch call that can process several boards at once, so you must select the first element of the array to get the probability distribution you want.
  • 3 The values are represented as a one-dimensional vector, so you must pull out the first element to get the value as a plain float.
  • 4 Include the estimated value in the experience buffer.

12.4. Training an actor-critic agent from experience data

Training your actor-critic network looks like a combination of training the policy network in chapter 10 and the action-value network in chapter 11. To train a two-output network, you construct a separate training target for each output, and choose a separate loss function for each output. This section describes how to convert the experience data to training targets, and how to use the Keras fit function with multiple outputs.

Recall how you encoded training data for policy gradient learning. For any game position, the training target was a vector the same size as the board, with a 1 or –1 in the slot corresponding to the chosen move; the 1 indicated a win, and the –1 indicated a loss. In your actor-critic learning, you use the same encoding scheme for the training data, but you replace the 1 or –1 with the advantage of the move. The advantage will have the same sign as the final reward, so the probability of the game decision will move in the same direction as in simple policy learning. But it’ll move further for actions that were deemed important, and move just a little for actions with an advantage that’s close to zero.

For the value output, the training target is the total reward. This looks exactly like the training target for Q-learning. Figure 12.4 illustrates the training setup.

Figure 12.4. Training setup for actor-critic learning. The neural network has two outputs: one for the policy and one for the value. Each gets its own training target. The policy output is trained against a vector the same size as the board. The cell in the vector corresponding to the chosen move is filled in with the advantage calculated for that move; the rest are zero. The value output is trained against the final outcome of the game.

When you have multiple outputs in a network, you can pick a different loss function for each output. You’ll use categorical cross-entropy for the policy output, and mean squared error for the value output. (Refer to chapters 10 and 11 for an explanation of why those loss functions make sense for those purposes.)

One new Keras feature you’ll use is loss weights. By default, Keras will sum the loss function for each output to get the overall loss function. If you specify loss weights, Keras will scale each individual loss function before summing. This allows you to adjust the relative importance of each output. In our experiments, we found the value loss was large compared to the policy loss, so we scaled down the value loss by half. Depending on your exact network and training data, you may need to adjust the loss weights somewhat.

Tip

Keras will print out the computed loss values every time you call fit. For a two-output network, it’ll print out the two losses separately. You can check there to see whether the magnitudes are comparable. If one loss is far larger than the other, consider adjusting the weights. Don’t worry about getting too precise.

The following listing shows how to encode the experience data as training data, and then call fit on the training targets. The structure is similar to the train implementations from chapters 10 and 11.

Listing 12.7. Selecting a move for an actor-critic agent
class ACAgent(Agent):
...
    def train(self, experience, lr=0.1, batch_size=128):         1
        opt = SGD(lr=lr)
        self.model.compile(
            optimizer=opt,
            loss=['categorical_crossentropy', 'mse'],            2
    loss_weights=[1.0, 0.5])                                     3

        n = experience.states.shape[0]
        num_moves = self.encoder.num_points()
        policy_target = np.zeros((n, num_moves))
        value_target = np.zeros((n,))
        for i in range(n):
            action = experience.actions[i]                       4
            policy_target[i][action] = experience.advantages[i]  4
            reward = experience.rewards[i]                       5
            value_target[i] = reward                             5

        self.model.fit(
            experience.states,
            [policy_target, value_target],
            batch_size=batch_size,
            epochs=1)

  • 1 lr (learning rate) and batch_size are tuning parameters for the optimizer; refer to chapter 10 for more discussion.
  • 2 categorical_crossentropy is for the policy output, just as in chapter 10. mse (mean squared error) is for the value output, just as in chapter 11. The order here matches the order in the Model constructor in listing 12.5.
  • 3 The weight 1.0 applies to the policy output; the weight 0.5 applies to the value output.
  • 4 This is the same as the encoding scheme used in chapter 10, but weighted by the advantage.
  • 5 This is the same as the encoding scheme used in chapter 11.

Now that you have all the pieces, let’s try actor-critic learning end-to-end. You’ll start with a 9 × 9 bot so you can see results faster. The cycle will go like this:

  1. Generate self-play games in chunks of 5,000.
  2. After each chunk, train the agent and compare it to the previous version of your bot.
  3. If the new bot can beat the previous bot 60 out of 100 games, you’ve successfully improved your agent! Start the process over with the new bot.
  4. If the updated bot wins fewer than 60 out of 100 games, generate another chunk of self-play games and retrain. Continue training until the new bot is strong enough.

The benchmark of 60 wins out of 100 is somewhat arbitrary; it’s a nice round number that gives you reasonable confidence that your bot is truly stronger, and not just lucky.

Start by initializing a bot with the init_ac_agent script (as shown in listing 12.5):

python init_ac_agent.py --board-size 9 ac_v1.hdf5

After this, you should have a new file, ac_v1.hdf5, that contains the weights for your new bot. At this point, both the bot’s play and its value estimates are essentially random. You can now start generating self-play games:

python self_play_ac.py 
--board-size 9 
--learning-agent ac_v1.hdf5 
--num-games 5000 
--experience-out exp_0001.hdf5

If you’re not fortunate enough to have access to a fast GPU, this is a good time to go out for a coffee or take the dog for a walk. When the self_play script is done, the output will look something like this:

Simulating game 1/5000...
 9 ooxxxxxxx
 8 ooox.xx.x
 7 oxxxxooxx
 6 oxxxxxox.
 5 oooooxoxx
 4 ooo.oooxo
 3 ooooooooo
 2 .oo.ooo.o
 1 ooooooooo
   ABCDEFGHJ
W+28.5
...
Simulating game 5000/5000...
 9 x.x.xxxxx
 8 xxxxx.xxx
 7 .x.xxxxoo
 6 xxxx.xo.o
 5 xxxxxxooo
 4 xooooooxo
 3 xoooxxxxo
 2 o.o.oxxxx
 1 ooooox.x.
   ABCDEFGHJ
B+15.5

After this, you should have an exp_0001.hdf5 file containing a big chunk of game records. The next step is to train:

python train_ac.py 
--learning-agent bots/ac_v1.hdf5 
--agent-out bots/ac_v2.hdf5 
--lr 0.01 --bs 1024 
exp_0001.hdf5

This will take the neural network currently stored in ac_v1.hdf1, run a single epoch of training against the data in exp_0001.hdf, and save the updated agent to ac_v2.hdf5. The optimizer will use a learning rate of 0.01 and a batch size of 1,024. You should see output something like this:

Epoch 1/1
574234/574234 [==============================] - 15s 26us/step - loss:
 1.0277 - dense_3_loss: 0.6403 - dense_5_loss: 0.7750

Notice that the loss is now broken into two values: dense_3_loss and dense_5_loss, corresponding to the policy output and the value output, respectively.

After this, you can compare the updated bot against its predecessor with the eval_ac_bot.py script:

python eval_ac_bot.py 
--agent1 bots/ac_v2.hdf5 
--agent2 bots/ac_v1.hdf5 
--num-games 100

The output should look something like this:

...
Simulating game 100/100...
 9 oooxxxxx.
 8 .oox.xxxx
 7 ooxxxxxxx
 6 .oxx.xxxx
 5 oooxxx.xx
 4 o.ox.xx.x
 3 ooxxxxxxx
 2 ooxx.xxxx
 1 oxxxxxxx.
   ABCDEFGHJ
B+31.5
Agent 1 record: 60/100

In this case, the output shows you exactly hit the threshold of 60 wins out of 100: you can have reasonable confidence that your bot has learned something useful. (This is just example output, of course; your actual results will look a little different, and that’s fine.) Because the ac_v2 bot is measurably stronger than ac_v1, you can switch to generating games with ac_v2:

python self_play_ac.py 
--board-size 9 
--learning-agent ac_v2.hdf5 
--num-games 5000 
--experience-out exp_0002.hdf5

When that’s done, you can train and evaluate again:

python train_ac.py 
--learning-agent bots/ac_v2.hdf5 
--agent-out bots/ac_v3.hdf5 
--lr 0.01 --bs 1024 
exp_0002.hdf5
python eval_ac_bot.py 
--agent1 bots/ac_v3.hdf5 
--agent2 bots/ac_v2.hdf5 
--num-games 100

This case wasn’t quite as successful as the last time:

Agent 1 record: 51/100

The ac_v3 bot beat the ac_v2 bot only 51 times out of 100. With those results, it’s hard to say whether ac_v3 is a tiny bit stronger or not; the safest conclusion is that it’s basically the same strength as ac_v2. But don’t despair. You can generate more training data and try again:

python self_play_ac.py 
--board-size 9 
--learning-agent ac_v2.hdf5 
--num-games 5000 
--experience-out exp_0002a.hdf5

The train_ac script will accept multiple training data files on the command line:

python train_ac.py 
--learning-agent ac_v2.hdf5 
--agent-out ac_v3.hdf5 
--lr 0.01 --bs 1024 
exp_0002.hdf5 exp_0002a.hdf5

After each additional batch of games, you can compare against ac_v2 again. In our experiments, we needed three batches of 5,000 games—a total of 15,000 games—before we got a satisfactory result:

Agent 1 record: 62/100

Success! With 62 wins against ac_v2, you can feel confident that ac_v3 is stronger than ac_v2. Now you can switch over to generating self-play games with ac_v3, and repeat the cycle again.

It’s unclear exactly how strong a Go bot can get with this actor-critic implementation alone. We’ve shown that you can train a bot to learn basic tactics, but its strength is bound to top out at some point. By deeply integrating reinforcement learning with a kind of tree search, you can train a bot that’s stronger than any human player; chapter 14 covers that technique.

12.5. Summary

  • Actor-critic learning is a reinforcement-learning technique in which you simultaneously learn a policy function and a value function. The policy function tells you how to make decisions, and the value function helps improve the training process for the value function. You can apply actor-critic learning to the same kinds of problems where you’d apply policy gradient learning, but actor-critic learning is often more stable.
  • Advantage is the difference between the actual reward an agent sees and the expected reward at some point in the episode. For games, this is the difference between the actual game result (win or loss) and the expected value (as estimated by the agent’s value model).
  • Advantage helps identify the important decisions in a game. If a learning agent wins a game, the advantage will be largest for moves it made from an even or losing position. The advantage will be close to zero for moves it made after the game was already decided.
  • A Keras sequential network can have multiple outputs. In actor-critic learning, this lets you create a single network to model both the policy function and the value function.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset