Chapter 11. Reinforcement learning with value methods

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Reinforcement learning with value methods

This chapter covers

Making a self-improving game AI with the Q-learning algorithm
Defining and training multi-input neural networks in Keras
Building and training a Q-learning agent by using Keras

Have you ever read an expert commentary on a high-level chess or Go tournament game? You’ll often see comments like, “Black is far behind at this point” or “The result up to here is slightly better for white.” What does it mean to be “ahead” or “behind” in the middle of such a strategy game? This isn’t basketball, with a running score to refer to. Instead, the commentator means that the board position is favorable to one player or the other. If you want to be precise, you could define it with a thought experiment. Find a hundred evenly matched pairs of players. Give each pair the board position from the middle of the game, and tell them to start playing from there. If the player taking black wins a small majority of the games—say, 55 out of 100—you can say the position was slightly good for black.

Of course, the commentators are doing no such thing. Instead, they’re relying on their own intuition, built up over thousands of games, to make a judgment on what might happen. In this chapter, we show how to train a computer game player to make similar judgments. And the computer will learn to do it in much the same way a human will: by playing many, many games.

This chapter introduces the Q-learning algorithm. Q-learning is a method for training a reinforcement-learning agent to anticipate how much reward it can expect in the future. (In the context of games, reward means winning games.) First, we describe how a Q-learning agent makes decisions and improves over time. After that, we show how to implement Q-learning within the Keras framework. Then you’ll be ready to train another self-improving game AI, with a different personality from the policy learning agent from chapter 10.

11.1. Playing games with Q-learning

Suppose you have a function that tells you your chances of winning after playing a particular move. That’s called an action-value function—it tells you how valuable a particular action is. Then playing the game would be easy: you’d just pick the highest-value move on each turn. The question, then, is how you come up with an action-value function.

This section describes Q-learning, a technique for training an action-value function through reinforcement learning. Of course, you can never learn the true action-value function for moves in Go: that would require reading out the entire game tree, with all its trillions of trillions of trillions of possibilities. But you can iteratively improve an estimate of the action-value function through self-play. As the estimate gets more accurate, a bot that relies on the estimate will get stronger.

The name Q-learning comes from standard mathematical notation. Traditionally, Q(s,a) is used to represent the action-value function. This is a function of two variables: s represents the state the agent is faced with (for example, a board position); a represents an action the agent is considering (a possible move to play next). Figure 11.1 illustrates the inputs to an action-value function. This chapter focuses on deep Q-learning, in which you use a neural network to estimate the Q function. But most of the principles also apply to classical Q-learning, where you approximate the Q function with a simple table that has a row for each possible state and a column for each possible action.

Figure 11.1. An action-value function takes two inputs: a state (board position) and an action (proposed move). It returns an estimate of the expected return (chance of winning the game) if the agent chooses this action. The action-value function is traditionally called Q in mathematical notation.

In the previous section, you studied reinforcement learning by directly learning a policy—a rule for selecting moves. The structure of Q-learning will seem familiar. First you make an agent play against itself, recording all the decisions and game results; the game results tell you something about whether the decisions were good; then you update the agent’s behavior accordingly. Q-learning differs from policy learning in the way the agent makes decisions in a game, and the way it updates its behavior based on the results.

To build a game-playing agent out of a Q function, you need to turn the Q function into a policy. One option is to plug in every possible move to the Q function and pick the move with the highest expected return, as shown in figure 11.2. This policy is called a greedy policy.

Figure 11.2. In a greedy action-value policy, you loop over every possible move and estimate the action value. Then you choose the action with the highest estimated value. (Many legal moves have been omitted to save space.)

If you have confidence in your estimates of the action values, a greedy policy is the best choice. But in order to improve your estimates, you need your bot to occasionally explore the unknown. This is called an ϵ-greedy policy: some fraction ϵ of the time, the policy chooses completely randomly; the rest of the time, it’s a normal greedy policy. Figure 11.3 shows this procedure as a flowchart.

Figure 11.3. Flowchart for the ϵ-greedy action-value policy. This policy tries to balance playing the best move with exploring unknown moves. The value of ϵ controls that balance.

Note

ϵ is the Greek letter epsilon, often used to represent a small fraction.

Listing 11.1. Pseudocode for an ϵ-greedy policy

def select_action(state, epsilon):
    possible_actions = get_possible_actions(state)
    if random.random() < epsilon:                                 1
        return random.choice(possible_actions)                    1
    best_action = None                                            2
    best_value = MIN_VALUE                                        2
    for action in get_possible_actions(state):                    2
        action_value = self.estimate_action_value(state, action)  2
        if action_value > best_value:                             2
            best_action = action                                  2
            best_value = action_value                             2
    return best_action

1 Random exploration case
2 Picking the best known move

The choice of ϵ represents a trade-off. When it’s close to 0, the agent chooses whatever the best moves are, according to its current estimate of the action-value; but the agent won’t have any opportunity to try new moves and thereby improve its estimates. When ϵ is higher, the agent will lose more games, but in exchange it’ll learn about many unknown moves.

You can make an analogy with the way humans learn a skill, whether it’s playing Go or playing the piano. It’s common for human learners to hit a plateau—a point where they’re comfortable with a certain range of skills but have stopped improving. To get over the hump, you need to force yourself out of your comfort zone and experiment with new things. Maybe that’s new fingerings and rhythms on a piano, or new openings and tactics in Go. Your performance may get worse while you find yourself in unfamiliar situations, but after you learn how the new techniques work, you come out stronger than before.

In Q-learning, you generally start with a fairly high value of ϵ, perhaps 0.5. As the agent improves, you gradually decrease ϵ. Note that if ϵ drops all the way to 0, your agent will stop learning: it’ll just play out the same game over and over again.

After you’ve generated a large set of games, the training process for Q-learning is similar to supervised learning. The actions that the agent took provide your training set, and you can treat the game outcomes as known good labels for the data. Of course, there’ll be some lucky wins in the training set; but over the course of thousands of games, you can rely on an equal number of losses to cancel them out.

In the same way that the move-prediction model from chapter 7 learned to predict human moves from games it had never seen before, the action-value model can learn to predict the value of moves it has never played before. You can use the game outcomes as the target for the training process, as shown in figure 11.4. To get it to generalize in this way, you need an appropriate neural network design and plenty of training data.

Figure 11.4. Setting up training data for deep Q-learning. At the top, we show how we created training data for the move-prediction networks you used in chapters 6 and 7. The board position was the input, and the actual move was the output. The bottom shows the structure for training data for Q-learning. Both the board position and chosen move are inputs; the output is the game outcome: a 1 for a win and a –1 for a loss.

11.2. Q-learning with Keras

This section shows how to implement the Q-learning algorithm in the Keras framework. So far, you’ve used Keras to train functions that have one input and one output. Because an action-value function has two inputs, you’ll need to use new Keras features to design an appropriate network. After introducing two-input networks in Keras, we show how to evaluate moves, assemble training data, and train your agent.

11.2.1. Building two-input networks in Keras

In previous chapters, you used the Keras Sequential model to define your neural networks. The following listing shows an example model defined with the sequential API.

Listing 11.2. Defining a model with the Keras sequential API

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(32, input_shape=(19, 19)))
model.add(Dense(24))

Keras provides a second API for defining a neural network: the functional API. The functional API provides a superset of the functionality of the sequential API. You can rewrite any sequential network in the functional style, and you can also create complex networks that can’t be described in the sequential style.

The main difference is in the way you specify the connections between layers. To connect layers in a sequential model, you repeatedly call add on the model object; that automatically connects the last layer’s output to the input of the new layer. To connect layers in a functional model, you pass the input layer to the next layer with syntax that looks like a function call. Because you’re explicitly creating each connection, you can describe more-complex networks. The following listing shows how to create an identical network to listing 11.2 by using the functional style.

Listing 11.3. Defining an identical model with the Keras functional API

from keras.models import Model
from keras.layers import Dense, Input

model_input = Input(shape=(19, 19))
hidden_layer = Dense(32)(model_input)       1
output_layer = Dense(24)(hidden_layer)      2

model = Model(inputs=[model_input], outputs=[output_layer])

1 Connects model_input to the input of a Dense layer, and names that layer hidden_layer
2 Connects hidden_layer to the input of a new Dense layer, and names that layer output_layer

These two models are identical. The sequential API is a convenient way to describe the most common neural networks, and the functional API provides the flexibility to specify multiple inputs and outputs, or complex connections.

Because your action-value network has two inputs and one output, at some point you need to merge the two input chains together. The Keras Concatenate layer lets you accomplish this. A Concatenate layer doesn’t do any computation; it just glues together two vectors or tensors into one, as shown in figure 11.5. It takes an optional axis argument that specifies which dimension to concatenate; it defaults to the last dimension, which is what you want in this case. All other dimensions must be the same size.

Figure 11.5. The Keras `Concatenate` layer appends two tensors into one.

Now you can design a network for learning an action-value function. Recall the convolutional networks you used for move prediction in chapters 6 and 7. You can conceptually chunk the network into two steps: first the convolutional layers identify the important shapes of stones on the board; then a dense layer makes a decision based on those shapes. Figure 11.6 shows how the layers in the move-prediction network perform two separate roles.

Figure 11.6. A move-prediction network, as covered in chapters 6 and 7. Although there are many layers, you can conceptually think of it as two steps. The convolutional layers process the raw stones and organize them into logical groups and tactical shapes. From that representation, the dense layers can choose an action.

For the action-value network, you still want to process the board into important shapes and groups of stones. Any shape that’s relevant for move prediction is likely to be relevant for estimating the action-value as well, so this part of the network can borrow the same structure. The difference comes in the decision-making step. Instead of making a decision based only on the identified groups of stones, you want to estimate a value based on the processed board and the proposed action. So you can bring in the proposed move vector after the convolutional layers. Figure 11.8 illustrates such a network.

Because you use –1 to represent a loss, and 1 to represent a win, the action-value should be a single value in the range of –1 to 1. To achieve this, you add a Dense layer of size 1 with a tanh activation. You may know tanh as the hyperbolic tangent function from trigonometry. In deep learning, you don’t care about the trigonometric properties of tanh at all. Instead, you use it because it’s a smooth function that’s bounded below by –1 and above by 1. No matter what the early layers of the network compute, the output will be in the desired range. Figure 11.7 shows a plot of the tanh function.

Figure 11.7. The tanh (hyperbolic tangent) function, which clamps its value between –1 and 1

The full specification for your action-value network looks like the following listing.

Listing 11.4. A two-input action-value network

from keras.models import Model
from keras.layers import Conv2D, Dense, Flatten, Input
from keras.layers import ZeroPadding2D, concatenate

board_input = Input(shape=encoder.shape(), name='board_input')
action_input = Input(shape=(encoder.num_points(),),
                          name='action_input')

conv1a = ZeroPadding2D((2, 2))(board_input)                     1
conv1b = Conv2D(64, (5, 5), activation='relu')(conv1a)          1
                                                                1
conv2a = ZeroPadding2D((1, 1))(conv1b)                          1
conv2b = Conv2D(64, (3, 3), actionvation='relu')(conv2a)        1

flat = Flatten()(conv2b)
processed_board = Dense(512)(flat)

board_and_action = concatenate([action_input, processed_board])
hidden_layer = Dense(256, activation='relu')(board_and_action)  2
value_output = Dense(1, activation='tanh')(hidden_layer)        3

model = Model(inputs=[board_input, action_input],
                outputs=value_output)

1 Add as many convolutional layers as you like. Anything that worked well for move prediction ought to work well here.
2 You may want to experiment with the size of this hidden layer.
3 The tanh activation layer clamps the output between –1 and 1.

Figure 11.8. The two-input neural network described in listing 11.4. The game board goes through several convolutional layers, just like the move-prediction network from chapter 7. The proposed move goes into a separate input. The proposed move is combined with the output of the convolutional layers and passed through another dense layer.

11.2.2. Implementing the ϵ-greedy policy with Keras

Let’s start building a QAgent that can learn via Q-learning; this code will live in the dlgo/rl/q.py module. Listing 11.5 shows the constructor: just as in our policy learning agent, it takes a model and a board encoder. You also define two utility methods. The set_temperature method lets you change the value of ϵ, which you’ll want to vary across the training process. Just as in chapter 9, the set_collector method lets you attach an ExperienceCollector object to store the experience data for later training.

Listing 11.5. Constructor and utility methods for a Q-learning agent

class QAgent(Agent):
    def __init__(self, model, encoder):
        self.model = model
        self.encoder = encoder
        self.collector = None
        self.temperature = 0.0

    def set_temperature(self, temperature):   1
        self.temperature = temperature        1

    def set_collector(self, collector):       2
        self.collector = collector            2

1 temperature is the ϵ value that controls how randomized the policy is.
2 See chapter 9 for more information about using a collector object to record the agent’s experiences.

Next, you implement the ϵ-greedy policy. Instead of just picking the top-rated move, you sort all the moves and try them in order. As in chapter 9, this prevents the agent from self-destructing at the end of a won game.

Listing 11.6. Selecting moves for a Q-learning agent

class QAgent(Agent):
    ...
    def select_move(self, game_state):
        board_tensor = self.encoder.encode(game_state)

        moves = []                                              1
        board_tensors = []                                      1
        for move in game_state.legal_moves():                   1
            if not move.is_play:                                1
                continue                                        1
            moves.append(self.encoder.encode_point(move.point)) 1
            board_tensors.append(board_tensor)                  1
        if not moves:                                           2
            return goboard.Move.pass_turn()                     2

        num_moves = len(moves)
        board_tensors = np.array(board_tensors)
        move_vectors = np.zeros(                                3
            (num_moves, self.encoder.num_points()))             3
        for i, move in enumerate(moves):                        3
            move_vectors[i][move] = 1                           3

        values = self.model.predict(                            4
            [board_tensors, move_vectors])                      4
        values = values.reshape(len(moves))                     4 5

        ranked_moves = self.rank_moves_eps_greedy(values)       6

        for move_idx in ranked_moves:                           7
            point = self.encoder.decode_point_index(            7
                moves[move_idx])                                7
            if not is_point_an_eye(game_state.board,            7
                                   point,                       7
                                   game_state.next_player):     7
                if self.collector is not None:                  8
                    self.collector.record_decision(             8
                        state=board_tensor,                     8
                        action=moves[move_idx],                 8
                    )                                           8
                return goboard.Move.play(point)
        return goboard.Move.pass_turn()                         9

1 Generates a list of all valid moves
2 If there are no valid moves left, the agent can just pass.
3 One-hot encodes all the valid moves (see chapter 5 for more on one-hot encoding).
4 This is the two-input form of predict: you pass the two inputs as a list.
5 Values will be an N × 1 matrix, where N is the number of legal moves; the reshape call converts to a vector of size N.
6 Ranks the moves according to the ϵ-greedy policy
7 Picks the first non-self-destructive move in your list, similar to the self-play agents from chapter 9
8 Records the decision in an experience buffer; see chapter 9
9 You’ll fall through here if all the valid moves are determined to be self-destructive.

Q-learning and tree search

The structure of the select_move implementation is similar to some of the tree-search algorithms we covered in chapter 4. For example, alpha-beta search relies on a board evaluation function: a function that takes a board position and estimates which player is ahead and by how much. This is similar, but not identical, to the action-value function we’ve covered in this chapter. Suppose the agent is playing as black and evaluating some move X. It gets an estimated action-value of 0.65 for X. Now, you know exactly what the board will look like after playing move X, and you know that a win for black is a loss for white. So you can say that the next board position has a value of –0.65 for white.

Mathematically, we describe the relationship as follows:

Q(s,a) = –V(s′)

where s’ is the state that white sees after black chooses move a.

Although Q-learning in general can apply to any environment, this equivalence between the action-value of one state and value of the next state is true only in deterministic games.

Chapter 12 covers a third reinforcement-learning technique that includes learning a value function directly, instead of an action-value function. Chapters 13 and 14 show methods for integrating such a value function with a tree-search algorithm.

All that remains is the code to sort the moves in order from most valuable to least valuable. The complication is that you have two parallel arrays: values and moves. The NumPy argsort function provides a convenient way to handle this. Instead of sorting an array in place, argsort returns a list of indices. Then you can read off the elements of the parallel array according to those indices. Figure 11.9 illustrates how argsort works. Listing 11.7 shows how to rank the moves by using argsort.

Figure 11.9. An illustration of the `argsort` function from the NumPy library. `argsort` takes a vector of values that you want to sort. Instead of sorting the values directly, it returns a vector of indices that will give you the values in sorted order. So the first value of the output vector is the index of the smallest value in the input, and the last value of the output vector is the index of the largest value in the input.

Listing 11.7. Selecting moves for a Q-learning agent

class QAgent(Agent):
    ...
    def rank_moves_eps_greedy(self, values):
        if np.random.random() < self.temperature:    1
            values = np.random.random(values.shape)  1
        ranked_moves = np.argsort(values)            2
        return ranked_moves[::-1]                    3

1 In the exploration case, rank the moves by random numbers instead of the real values.
2 Gets the indices of the moves in order from least value to highest value
3 The [::-1] syntax is the most efficient way to reverse a vector in NumPy. This returns the moves in order from highest value to least value.

With this in place, you’re ready to generate self-play games with your Q-learning agent. Next, we cover how to train the action-value network.

11.2.3. Training an action-value function

After you have a batch of experience data, you’re ready to update the agent’s network. With policy gradient learning, you knew the approximate gradient that you wanted, but you had to come up with a complicated scheme to apply that gradient update inside the Keras framework. In contrast, training with Q-learning is a straightforward application of the Keras fit function. You can directly put the game results into the target vector.

Chapter 6 covered two loss functions: mean squared error and cross-entropy loss. You used cross-entropy loss when you wanted to match one out of a discrete set of items: in that case, you were trying to match one of the points on a Go board. The Q function, on the other hand, has a continuous value that can be anywhere in the range –1 to 1. For this problem, we prefer mean squared error.

The following listing shows the implementation of a train function for the QAgent class.

Listing 11.8. Training the Q-learning agent from its experience

class QAgent(Agent):
...
    def train(self, experience, lr=0.1, batch_size=128):      1
        opt = SGD(lr=lr)
        self.model.compile(loss='mse', optimizer=opt)         2

        n = experience.states.shape[0]
        num_moves = self.encoder.num_points()
        y = np.zeros((n,))
        actions = np.zeros((n, num_moves))
        for i in range(n):
            action = experience.actions[i]
            reward = experience.rewards[i]
            actions[i][action] = 1
            y[i] = reward

        self.model.fit(
            [experience.states, actions], y,                  3
            batch_size=batch_size,
            epochs=1)

1 lr and batch_size are options to fine-tune the training process. See chapter 10 for more discussion.
2 mse is mean squared error. You use mse instead of categorical_crossentropy because you’re trying to learn a continuous value.
3 Passes the two different inputs as a list

11.3. Summary

An action-value function estimates how much reward an agent can expect after taking a specific action. In the case of games, this means the expected chance of winning.
Q-learning is the technique of reinforcement learning by estimating an action-value function (traditionally notated as Q).
While training a Q-learning agent, you normally use an ϵ-greedy policy. Under this policy, the agent will select the highest-valued move a fraction of the time, and a random move the rest of the time. The parameter ϵ controls how much the agent will explore unknown moves in order to learn more about them.
The Keras functional API lets you design neural networks with multiple inputs, multiple outputs, or complex internal connections. For Q-learning, you can use the functional API to build a network with separate inputs for the game state and the proposed move.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11. Reinforcement learning with value methods

Create new playlist

Sign In

Sign Up

Chapter 11. Reinforcement learning with value methods

11.1. Playing games with Q-learning

Figure 11.1. An action-value function takes two inputs: a state (board position) and an action (proposed move). It returns an estimate of the expected return (chance of winning the game) if the agent chooses this action. The action-value function is traditionally called Q in mathematical notation.

Figure 11.2. In a greedy action-value policy, you loop over every possible move and estimate the action value. Then you choose the action with the highest estimated value. (Many legal moves have been omitted to save space.)

Figure 11.3. Flowchart for the ϵ-greedy action-value policy. This policy tries to balance playing the best move with exploring unknown moves. The value of ϵ controls that balance.

Note

Listing 11.1. Pseudocode for an ϵ-greedy policy

11.2. Q-learning with Keras

11.2.1. Building two-input networks in Keras

Listing 11.2. Defining a model with the Keras sequential API

Listing 11.3. Defining an identical model with the Keras functional API

Figure 11.5. The Keras Concatenate layer appends two tensors into one.

Figure 11.7. The tanh (hyperbolic tangent) function, which clamps its value between –1 and 1

Listing 11.4. A two-input action-value network

11.2.2. Implementing the ϵ-greedy policy with Keras

Listing 11.5. Constructor and utility methods for a Q-learning agent

Listing 11.6. Selecting moves for a Q-learning agent

Listing 11.7. Selecting moves for a Q-learning agent

11.2.3. Training an action-value function

Listing 11.8. Training the Q-learning agent from its experience

11.3. Summary

Table of Contents for
Chapter 11. Reinforcement learning with value methods

Figure 11.5. The Keras `Concatenate` layer appends two tensors into one.