This chapter covers
Have you ever read an expert commentary on a high-level chess or Go tournament game? You’ll often see comments like, “Black is far behind at this point” or “The result up to here is slightly better for white.” What does it mean to be “ahead” or “behind” in the middle of such a strategy game? This isn’t basketball, with a running score to refer to. Instead, the commentator means that the board position is favorable to one player or the other. If you want to be precise, you could define it with a thought experiment. Find a hundred evenly matched pairs of players. Give each pair the board position from the middle of the game, and tell them to start playing from there. If the player taking black wins a small majority of the games—say, 55 out of 100—you can say the position was slightly good for black.
Of course, the commentators are doing no such thing. Instead, they’re relying on their own intuition, built up over thousands of games, to make a judgment on what might happen. In this chapter, we show how to train a computer game player to make similar judgments. And the computer will learn to do it in much the same way a human will: by playing many, many games.
This chapter introduces the Q-learning algorithm. Q-learning is a method for training a reinforcement-learning agent to anticipate how much reward it can expect in the future. (In the context of games, reward means winning games.) First, we describe how a Q-learning agent makes decisions and improves over time. After that, we show how to implement Q-learning within the Keras framework. Then you’ll be ready to train another self-improving game AI, with a different personality from the policy learning agent from chapter 10.
Suppose you have a function that tells you your chances of winning after playing a particular move. That’s called an action-value function—it tells you how valuable a particular action is. Then playing the game would be easy: you’d just pick the highest-value move on each turn. The question, then, is how you come up with an action-value function.
This section describes Q-learning, a technique for training an action-value function through reinforcement learning. Of course, you can never learn the true action-value function for moves in Go: that would require reading out the entire game tree, with all its trillions of trillions of trillions of possibilities. But you can iteratively improve an estimate of the action-value function through self-play. As the estimate gets more accurate, a bot that relies on the estimate will get stronger.
The name Q-learning comes from standard mathematical notation. Traditionally, Q(s,a) is used to represent the action-value function. This is a function of two variables: s represents the state the agent is faced with (for example, a board position); a represents an action the agent is considering (a possible move to play next). Figure 11.1 illustrates the inputs to an action-value function. This chapter focuses on deep Q-learning, in which you use a neural network to estimate the Q function. But most of the principles also apply to classical Q-learning, where you approximate the Q function with a simple table that has a row for each possible state and a column for each possible action.
In the previous section, you studied reinforcement learning by directly learning a policy—a rule for selecting moves. The structure of Q-learning will seem familiar. First you make an agent play against itself, recording all the decisions and game results; the game results tell you something about whether the decisions were good; then you update the agent’s behavior accordingly. Q-learning differs from policy learning in the way the agent makes decisions in a game, and the way it updates its behavior based on the results.
To build a game-playing agent out of a Q function, you need to turn the Q function into a policy. One option is to plug in every possible move to the Q function and pick the move with the highest expected return, as shown in figure 11.2. This policy is called a greedy policy.
If you have confidence in your estimates of the action values, a greedy policy is the best choice. But in order to improve your estimates, you need your bot to occasionally explore the unknown. This is called an ϵ-greedy policy: some fraction ϵ of the time, the policy chooses completely randomly; the rest of the time, it’s a normal greedy policy. Figure 11.3 shows this procedure as a flowchart.
ϵ is the Greek letter epsilon, often used to represent a small fraction.
def select_action(state, epsilon): possible_actions = get_possible_actions(state) if random.random() < epsilon: 1 return random.choice(possible_actions) 1 best_action = None 2 best_value = MIN_VALUE 2 for action in get_possible_actions(state): 2 action_value = self.estimate_action_value(state, action) 2 if action_value > best_value: 2 best_action = action 2 best_value = action_value 2 return best_action
The choice of ϵ represents a trade-off. When it’s close to 0, the agent chooses whatever the best moves are, according to its current estimate of the action-value; but the agent won’t have any opportunity to try new moves and thereby improve its estimates. When ϵ is higher, the agent will lose more games, but in exchange it’ll learn about many unknown moves.
You can make an analogy with the way humans learn a skill, whether it’s playing Go or playing the piano. It’s common for human learners to hit a plateau—a point where they’re comfortable with a certain range of skills but have stopped improving. To get over the hump, you need to force yourself out of your comfort zone and experiment with new things. Maybe that’s new fingerings and rhythms on a piano, or new openings and tactics in Go. Your performance may get worse while you find yourself in unfamiliar situations, but after you learn how the new techniques work, you come out stronger than before.
In Q-learning, you generally start with a fairly high value of ϵ, perhaps 0.5. As the agent improves, you gradually decrease ϵ. Note that if ϵ drops all the way to 0, your agent will stop learning: it’ll just play out the same game over and over again.
After you’ve generated a large set of games, the training process for Q-learning is similar to supervised learning. The actions that the agent took provide your training set, and you can treat the game outcomes as known good labels for the data. Of course, there’ll be some lucky wins in the training set; but over the course of thousands of games, you can rely on an equal number of losses to cancel them out.
In the same way that the move-prediction model from chapter 7 learned to predict human moves from games it had never seen before, the action-value model can learn to predict the value of moves it has never played before. You can use the game outcomes as the target for the training process, as shown in figure 11.4. To get it to generalize in this way, you need an appropriate neural network design and plenty of training data.
This section shows how to implement the Q-learning algorithm in the Keras framework. So far, you’ve used Keras to train functions that have one input and one output. Because an action-value function has two inputs, you’ll need to use new Keras features to design an appropriate network. After introducing two-input networks in Keras, we show how to evaluate moves, assemble training data, and train your agent.
In previous chapters, you used the Keras Sequential model to define your neural networks. The following listing shows an example model defined with the sequential API.
from keras.models import Sequential from keras.layers import Dense model = Sequential() model.add(Dense(32, input_shape=(19, 19))) model.add(Dense(24))
Keras provides a second API for defining a neural network: the functional API. The functional API provides a superset of the functionality of the sequential API. You can rewrite any sequential network in the functional style, and you can also create complex networks that can’t be described in the sequential style.
The main difference is in the way you specify the connections between layers. To connect layers in a sequential model, you repeatedly call add on the model object; that automatically connects the last layer’s output to the input of the new layer. To connect layers in a functional model, you pass the input layer to the next layer with syntax that looks like a function call. Because you’re explicitly creating each connection, you can describe more-complex networks. The following listing shows how to create an identical network to listing 11.2 by using the functional style.
from keras.models import Model from keras.layers import Dense, Input model_input = Input(shape=(19, 19)) hidden_layer = Dense(32)(model_input) 1 output_layer = Dense(24)(hidden_layer) 2 model = Model(inputs=[model_input], outputs=[output_layer])
These two models are identical. The sequential API is a convenient way to describe the most common neural networks, and the functional API provides the flexibility to specify multiple inputs and outputs, or complex connections.
Because your action-value network has two inputs and one output, at some point you need to merge the two input chains together. The Keras Concatenate layer lets you accomplish this. A Concatenate layer doesn’t do any computation; it just glues together two vectors or tensors into one, as shown in figure 11.5. It takes an optional axis argument that specifies which dimension to concatenate; it defaults to the last dimension, which is what you want in this case. All other dimensions must be the same size.
Now you can design a network for learning an action-value function. Recall the convolutional networks you used for move prediction in chapters 6 and 7. You can conceptually chunk the network into two steps: first the convolutional layers identify the important shapes of stones on the board; then a dense layer makes a decision based on those shapes. Figure 11.6 shows how the layers in the move-prediction network perform two separate roles.
For the action-value network, you still want to process the board into important shapes and groups of stones. Any shape that’s relevant for move prediction is likely to be relevant for estimating the action-value as well, so this part of the network can borrow the same structure. The difference comes in the decision-making step. Instead of making a decision based only on the identified groups of stones, you want to estimate a value based on the processed board and the proposed action. So you can bring in the proposed move vector after the convolutional layers. Figure 11.8 illustrates such a network.
Because you use –1 to represent a loss, and 1 to represent a win, the action-value should be a single value in the range of –1 to 1. To achieve this, you add a Dense layer of size 1 with a tanh activation. You may know tanh as the hyperbolic tangent function from trigonometry. In deep learning, you don’t care about the trigonometric properties of tanh at all. Instead, you use it because it’s a smooth function that’s bounded below by –1 and above by 1. No matter what the early layers of the network compute, the output will be in the desired range. Figure 11.7 shows a plot of the tanh function.
The full specification for your action-value network looks like the following listing.
from keras.models import Model from keras.layers import Conv2D, Dense, Flatten, Input from keras.layers import ZeroPadding2D, concatenate board_input = Input(shape=encoder.shape(), name='board_input') action_input = Input(shape=(encoder.num_points(),), name='action_input') conv1a = ZeroPadding2D((2, 2))(board_input) 1 conv1b = Conv2D(64, (5, 5), activation='relu')(conv1a) 1 1 conv2a = ZeroPadding2D((1, 1))(conv1b) 1 conv2b = Conv2D(64, (3, 3), actionvation='relu')(conv2a) 1 flat = Flatten()(conv2b) processed_board = Dense(512)(flat) board_and_action = concatenate([action_input, processed_board]) hidden_layer = Dense(256, activation='relu')(board_and_action) 2 value_output = Dense(1, activation='tanh')(hidden_layer) 3 model = Model(inputs=[board_input, action_input], outputs=value_output)
Let’s start building a QAgent that can learn via Q-learning; this code will live in the dlgo/rl/q.py module. Listing 11.5 shows the constructor: just as in our policy learning agent, it takes a model and a board encoder. You also define two utility methods. The set_temperature method lets you change the value of ϵ, which you’ll want to vary across the training process. Just as in chapter 9, the set_collector method lets you attach an ExperienceCollector object to store the experience data for later training.
class QAgent(Agent): def __init__(self, model, encoder): self.model = model self.encoder = encoder self.collector = None self.temperature = 0.0 def set_temperature(self, temperature): 1 self.temperature = temperature 1 def set_collector(self, collector): 2 self.collector = collector 2
Next, you implement the ϵ-greedy policy. Instead of just picking the top-rated move, you sort all the moves and try them in order. As in chapter 9, this prevents the agent from self-destructing at the end of a won game.
class QAgent(Agent): ... def select_move(self, game_state): board_tensor = self.encoder.encode(game_state) moves = [] 1 board_tensors = [] 1 for move in game_state.legal_moves(): 1 if not move.is_play: 1 continue 1 moves.append(self.encoder.encode_point(move.point)) 1 board_tensors.append(board_tensor) 1 if not moves: 2 return goboard.Move.pass_turn() 2 num_moves = len(moves) board_tensors = np.array(board_tensors) move_vectors = np.zeros( 3 (num_moves, self.encoder.num_points())) 3 for i, move in enumerate(moves): 3 move_vectors[i][move] = 1 3 values = self.model.predict( 4 [board_tensors, move_vectors]) 4 values = values.reshape(len(moves)) 4 5 ranked_moves = self.rank_moves_eps_greedy(values) 6 for move_idx in ranked_moves: 7 point = self.encoder.decode_point_index( 7 moves[move_idx]) 7 if not is_point_an_eye(game_state.board, 7 point, 7 game_state.next_player): 7 if self.collector is not None: 8 self.collector.record_decision( 8 state=board_tensor, 8 action=moves[move_idx], 8 ) 8 return goboard.Move.play(point) return goboard.Move.pass_turn() 9
The structure of the select_move implementation is similar to some of the tree-search algorithms we covered in chapter 4. For example, alpha-beta search relies on a board evaluation function: a function that takes a board position and estimates which player is ahead and by how much. This is similar, but not identical, to the action-value function we’ve covered in this chapter. Suppose the agent is playing as black and evaluating some move X. It gets an estimated action-value of 0.65 for X. Now, you know exactly what the board will look like after playing move X, and you know that a win for black is a loss for white. So you can say that the next board position has a value of –0.65 for white.
Mathematically, we describe the relationship as follows:
where s’ is the state that white sees after black chooses move a.
Although Q-learning in general can apply to any environment, this equivalence between the action-value of one state and value of the next state is true only in deterministic games.
Chapter 12 covers a third reinforcement-learning technique that includes learning a value function directly, instead of an action-value function. Chapters 13 and 14 show methods for integrating such a value function with a tree-search algorithm.
All that remains is the code to sort the moves in order from most valuable to least valuable. The complication is that you have two parallel arrays: values and moves. The NumPy argsort function provides a convenient way to handle this. Instead of sorting an array in place, argsort returns a list of indices. Then you can read off the elements of the parallel array according to those indices. Figure 11.9 illustrates how argsort works. Listing 11.7 shows how to rank the moves by using argsort.
class QAgent(Agent): ... def rank_moves_eps_greedy(self, values): if np.random.random() < self.temperature: 1 values = np.random.random(values.shape) 1 ranked_moves = np.argsort(values) 2 return ranked_moves[::-1] 3
With this in place, you’re ready to generate self-play games with your Q-learning agent. Next, we cover how to train the action-value network.
After you have a batch of experience data, you’re ready to update the agent’s network. With policy gradient learning, you knew the approximate gradient that you wanted, but you had to come up with a complicated scheme to apply that gradient update inside the Keras framework. In contrast, training with Q-learning is a straightforward application of the Keras fit function. You can directly put the game results into the target vector.
Chapter 6 covered two loss functions: mean squared error and cross-entropy loss. You used cross-entropy loss when you wanted to match one out of a discrete set of items: in that case, you were trying to match one of the points on a Go board. The Q function, on the other hand, has a continuous value that can be anywhere in the range –1 to 1. For this problem, we prefer mean squared error.
The following listing shows the implementation of a train function for the QAgent class.
class QAgent(Agent): ... def train(self, experience, lr=0.1, batch_size=128): 1 opt = SGD(lr=lr) self.model.compile(loss='mse', optimizer=opt) 2 n = experience.states.shape[0] num_moves = self.encoder.num_points() y = np.zeros((n,)) actions = np.zeros((n, num_moves)) for i in range(n): action = experience.actions[i] reward = experience.rewards[i] actions[i][action] = 1 y[i] = reward self.model.fit( [experience.states, actions], y, 3 batch_size=batch_size, epochs=1)