Chapter 9. Learning by practice: reinforcement learning

This chapter covers

  • Defining a task for reinforcement learning
  • Building a learning agent for games
  • Collecting self-play experiences for training

I’ve probably read a dozen books on Go, all written by strong pros from China, Korea, and Japan. And yet I’m just an intermediate amateur player. Why haven’t I reached the level of these legendary players? Have I forgotten their lessons? I don’t think that’s it; I can practically recite Toshiro Kageyama’s Lessons in the Fundamentals of Go (Ishi Press, 1978) by heart. Maybe I just need to read more books....

I don’t know the full recipe for becoming a top Go star, but I know at least one difference between me and Go professionals: practice. A Go player probably clocks in five or ten thousand games before qualifying as a professional. Practice creates knowledge, and sometimes that’s knowledge that you can’t directly communicate. You can summarize that knowledge—that’s what makes it into Go books. But the subtleties get lost in the translation. If I expect to master the lessons I’ve read, I need to put in a similar level of practice.

If practice is so valuable for humans, what about computers? Can a computer program learn by practicing? That’s the promise of reinforcement learning. In reinforcement learning (RL), you improve a program by having it repeatedly attempt a task. When it has good outcomes, you modify the program to repeat its decisions. When it has bad outcomes, you modify the program to avoid those decisions. This doesn’t mean you write new code after each trial: RL algorithms provide automated methods for making those modifications.

Reinforcement learning isn’t a free lunch. For one thing, it’s slow: your bot will need to play thousands of games in order to make a measurable improvement. In addition, the training process is fiddly and hard to debug. But if you put in the effort to make these techniques work for you, the payoff is huge. You can build software that applies sophisticated strategies to tackle a variety of tasks, even if you can’t describe those strategies yourself.

This chapter starts with a birds-eye view of the reinforcement-learning cycle. Next, you’ll see how to set up a Go bot to play against itself in a way that fits into the reinforcement-learning process. Chapter 10 shows how to use the self-play data to improve your bot’s performance.

9.1. The reinforcement-learning cycle

Many algorithms implement the mechanics of reinforcement learning, but they all work within a standard framework. This section describes the reinforcement-learning cycle, in which a computer program improves by repeatedly attempting a task. Figure 9.1 illustrates the cycle.

Figure 9.1. The reinforcement-learning cycle. You can implement reinforcement learning in many ways, but the overall process has a common structure. First, a computer program attempts a task repeatedly. The records of these attempts are called experience data. Next, you modify the behavior to imitate the more successful attempts; this process is training. You then periodically evaluate the performance to confirm that the program is improving. Normally, you need to repeat this process for many cycles.

In the language of reinforcement learning, your Go bot is an agent: a program that makes decisions in order to accomplish a task. Earlier in the book, you implemented several versions of an Agent class that could choose Go moves. In those cases, you provided the agent with a situation—a GameState object—and it responded with a decision—a move to play. Although you weren’t using reinforcement learning at that time, the concept of an agent is the same.

The goal of reinforcement learning is to make the agent as effective as possible. In this case, you want your agent to win at Go.

First, you have your Go bot play a batch of games against itself; during each game, it should record every turn and the final outcome. These game records are called its experience.

Next, you train your bot by updating its behavior in response to what happened in its self-play games. This process is similar to training the neural networks covered in chapters 6 and 7. The core idea is that you want the bot to repeat the decisions it made in games it won, and stop making the decisions it made in games it lost. The training algorithm comes as a package deal with the structure of your agent: you need to be able to systematically modify the behavior of your agent in order to train. There are many algorithms for doing this; we cover three in this book. In this chapter and the next, we start with the policy gradient algorithm. In chapter 11, we cover the Q-learning algorithm. Chapter 12 introduces the actor-critic algorithm.

After training, you expect your bot to be a bit stronger. But there are many ways for the training process to go wrong, so it’s a good idea to evaluate the bot’s progress to confirm its strength. To evaluate a game-playing agent, have it play more games. You can pit your agent against earlier versions of itself to measure its progress. As a sanity check, you can also periodically compare your bot to other AIs or play against it yourself.

Then you can repeat this entire cycle indefinitely:

  • Collect experience
  • Train
  • Evaluate

We’ll break this cycle into multiple scripts. In this chapter, you’ll implement a self_play script that will simulate the self-play games and save the experience data to disk. In the next chapter, you’ll make a train script that takes the experience data as input, updates the agent accordingly, and saves the new agent.

9.2. What goes into experience?

In chapter 3, you designed a set of data structures for representing Go games. You can imagine how you could store an entire game record by using classes such as Move, GoBoard, and GameState. But reinforcement-learning algorithms are generic: they deal with a highly abstract representation of a problem, so that the same algorithms can apply to as many problem domains as possible. This section shows how to describe game records in the language of reinforcement learning.

In the case of game playing, you can divide your experience into individual games, or episodes. An episode has a clear end, and decisions made during one episode have no bearing on what happens in the next. In other domains, you may not have any obvious way to divide the experience into episodes; for example, a robot that’s designed to operate continuously makes an endless sequence of decisions. You can still apply reinforcement learning to such problems, but the episode boundaries here make it a little simpler.

Within an episode, an agent is faced with a state of its environment. Based on the current state, the agent must select an action. After choosing an action, the agent sees a new state; the next state depends on both the chosen action and whatever else is going on in the environment. In the case of Go, your AI will see a board position (the state), and then select a legal move (an action). After that, the AI will see a new board position on its next turn (the next state).

Note that after the agent chooses an action, the next state also includes the opponent’s move. You can’t determine the next state from the current state and the action you choose: you must also wait for the opponent’s move. The opponent’s behavior is part of the environment that your agent must learn to navigate.

In order to improve, your agent needs feedback about whether it’s achieving its objective. You provide that feedback by calculating its reward, a numerical score for meeting a goal. For your Go AI, the goal is to win a game, so you’ll communicate a reward of 1 each time it wins and –1 each time it loses. Reinforcement-learning algorithms will modify the agent’s behavior so as to increase the amount of reward it accumulates. Figure 9.2 illustrates how a game of Go can be described with states, actions, and rewards.

Figure 9.2. A game of 5 × 5 Go translated into the language of reinforcement learning. The agent that you want to train is the black player. It sees a sequence of states (board positions) and chooses actions (legal moves). At the end of an episode (a complete game), it gets a reward to indicate whether it achieved its goal. In this case, black wins the game, so the agent sees a reward of +1.

Go and similar games are special cases: the reward comes all at once, at the end of the game. And there are only two possible rewards: you win or you lose, and you don’t care about what else happens in the game. In other domains, the reward may be spread out. Imagine making an AI to play Scrabble. On each turn, the AI will place a word and score points, and then its opponent will do the same. In that case, you can compute a positive reward for the AI’s points, and a negative reward for the opponent’s points. Then the AI doesn’t have to wait all the way to the end of an episode for its reward; it gets little pieces of its reward after every action it takes.

A key idea in reinforcement learning is that an action may be responsible for a reward that comes much later. Imagine you make an especially clever play on move 35 of a game, and continue on to win after 200 moves. Your good move early on deserves at least some of the credit for the win. You must somehow split up the credit for the reward over all the moves in the game. The future reward that your agent sees after an action is called the return on that action. To compute the return on an action, you add up all the rewards the agent saw after that action, all the way to the end of the episode, as shown in listing 9.1. This is a way of saying that you don’t know, in advance, which moves are responsible for winning or losing. The onus is on the learning algorithm to split up the credit or blame among individual moves.

Listing 9.1. Calculating return on an action
for exp_idx in range(exp_length):
    total_return[exp_idx] = reward[exp_idx]                   1
    for future_reward_idx in range(exp_idx + 1, exp_length):  2
               total_return[exp_idx] += reward[future_reward_idx]    2

  • 1 reward[i] is the reward the agent saw immediately after action i.
  • 2 Loops over all future rewards and adds them into the return

That assumption doesn’t make sense for every problem. Consider our Scrabble example again. The decisions you make on your first turn could plausibly affect your score on your third turn—maybe you held a high-scoring X in reserve until you could combine it with a bonus square. But it’s hard to see how decisions on your third turn could affect your twentieth. To represent this concept in your return calculation, you can compute a weighted sum of the future rewards from each action. The weights should get smaller as you go further from the action, so that far-future rewards have less influence than immediate rewards.

This technique is called discounting the reward. Listing 9.2 shows how to calculate discounted returns. In that example, each action gets full credit for the reward that comes immediately after. But the reward from the next step counts for only 75% as much; the reward two steps out counts 75% × 75% = 56% as much; and so on. The choice of 75% is just an example; the correct discount rate will depend on your particular domain, and you may need to experiment a bit to find the most effective number.

Listing 9.2. Calculating discounted returns
for exp_idx in range(exp_length):
    discounted_return[exp_idx] = reward[exp_idx]
    discount_amount = 0.75
    for future_reward_idx in range(exp_idx + 1, exp_length):
        discounted_return[exp_idx] +=
            discount_amount * reward[future_reward_idx]
        discount_amount *= 0.75                           1

  • 1 The discount _amount gets smaller and smaller as you get further from the original action.

In the case of building a Go AI, the only possible reward is a win or loss. This lets you take a shortcut in the return calculation. When your agent wins, every action in the game has a return of 1. When your agent loses, every action has a return of –1.

9.3. Building an agent that can learn

Reinforcement learning can’t create a Go AI, or any other kind of agent, out of thin air. It can only improve a bot that already works within the parameters of the game. To get started, you need an agent that can at least complete a game. This section shows how to create a Go bot that selects moves by using a neural network. If you start with an untrained network, the bot will play as badly as your original RandomAgent from chapter 3. Later, you can improve this neural network through reinforcement learning.

A policy is a function that selects an action from a given state. In earlier chapters, you saw several implementations of the Agent class that have a select_move function. Each of those select_move functions is a policy: a game state comes in, and a move comes out. All the policies you’ve implemented so far are valid, in the sense that they produce legal moves. But they’re not equally good: the MCTSAgent from chapter 4 will defeat the RandomAgent from chapter 3 more often than not. If you want to improve one of these agents, you need to think of an improvement to the algorithm, write new code, and test it—the standard software development process.

To use reinforcement learning, you need a policy that you can update automatically, using another computer program. In chapter 6, you studied a class of functions that lets you do exactly that: convolutional neural networks. A deep neural network can compute sophisticated logic, and you can modify its behavior by using the gradient descent algorithm.

The move-prediction neural network you designed in chapters 6 and 7 outputs a vector with a value for each point on the board; the value represents the network’s confidence that point would be the next play. How can you form a policy from such an output? One way is to simply select the move with the highest value. This will produce good results if your network has already been trained to select good moves. But it’ll always select the same move for any given board position. This creates a problem for reinforcement learning. To improve through reinforcement learning, you need to select a variety of moves. Some will be better, and some will be worse; you can detect the good moves by looking at the outcomes they produce. But you need the variety in order to improve.

Instead of always selecting the highest-rated move, you want a stochastic policy. Here, stochastic means that if you input the exact same board position twice, your agent may select different moves. This involves randomness, but not in the same way as your RandomAgent from chapter 3. The RandomAgent chose moves with no regard to what was happening in the game. A stochastic policy means that your move selection will depend on the state of the board, but it won’t be 100% predictable.

9.3.1. Sampling from a probability distribution

For any board position, your neural network will give you a vector with one element for each board position. To create a policy from this, you can treat each element of the vector as indicating the probability that you select a particular move. This section shows how to select moves according to those probabilities.

For example, if you’re playing rock-paper-scissors, you could follow a policy of choosing rock 50% of the time, paper 30% of the time, and scissors 20% of the time. The 50%-30%-20% split is a probability distribution over the three choices. Note that probabilities sum to exactly 100%: this is because your policy must always choose exactly one item from the list. This is a necessary property of a probability distribution; a 50%-30%-10% policy would leave you with no decision 10% of the time.

The process of randomly selecting one of those items in those proportions is called sampling from that probability distribution. The following listing shows a Python function that will choose one of those options according to that policy.

Listing 9.3. An example of sampling from a probability distribution
import random

def rps():
    randval = random.random()
    if 0.0 <= randval < 0.5:
        return 'rock'
    elif 0.5 <= randval < 0.8:
        return 'paper'
    else:
        return 'scissors'

Try this snippet out a few times and see how it behaves. You’ll see rock more than paper, and paper more than scissors. But all three will appear regularly.

This logic for sampling from a probability distribution is built into NumPy as the np.random.choice function. The following listing shows the exact same behavior implemented with NumPy.

Listing 9.4. Sampling from a probability distribution with NumPy
import numpy as np

def rps():
    return np.random.choice(
        ['rock', 'paper', 'scissors'],
        p=[0.5, 0.3, 0.2])

In addition, np.random.choice will handle repeated sampling from the same distribution. It’ll sample from your distribution once, remove that item from the list, and sample again from the remaining items. In this way, you get a semirandom ordered list. The high-probability items are likely to appear near the front of the list, but some variety remains. The following listing shows how to get repeated sampling with np.random.choice. You pass size=3 to indicate that you want three different items, and replace=False to indicate that you don’t want any results repeated.

Listing 9.5. Repeatedly sampling from a probability distribution with NumPy
import numpy as np

def repeated_rps():
    return np.random.choice(
        ['rock', 'paper', 'scissors'],
        size=3,
        replace=False,
        p=[0.5, 0.3, 0.2])

The repeated sampling will be useful in case your Go policy recommends an invalid move. In that case, you’ll want to select another one. You can call np.random.choice once and then just work your way down the list it generates.

9.3.2. Clipping a probability distribution

The reinforcement-learning process can be fairly unstable, especially early on. The agent may overreact to a few chance wins and temporarily assign a high probability to moves that really aren’t that good. (In that respect, it’s not unlike human beginners!) It’s possible for the probability for a particular move to go all the way to 1. This creates a subtle problem: because your agent will always select the same move, it has no opportunity to unlearn it.

To prevent this, you’ll clip the probability distribution to make sure no probabilities get pushed all the way to 0 or 1. You did the same with the DeepLearningAgent from chapter 8. The np.clip function from NumPy handles most of the work here.

Listing 9.6. Clipping a probability distribution
def clip_probs(original_probs):
    min_p = 1e-5
    max_p = 1 - min_p
    clipped_probs = np.clip(original_probs, min_p, max_p)
    clipped_probs = clipped_probs / np.sum(clipped_probs)      1
    return clipped_probs

  • 1 Ensure that the result is still a valid probability distribution.

9.3.3. Initializing an agent

Let’s start building out a new type of agent, a PolicyAgent, that selects moves according to a stochastic policy and can learn from experience data. This model can be identical to the move-prediction model from chapters 6 and 7; the only difference is in how you train it. You’ll add this to your dlgo library in the dlgo/agent/pg.py module.

Recall from the previous chapters that your model needs a matching board-encoding scheme. The PolicyAgent class can accept the model and board encoder in the constructor. This creates a nice separation of concerns. The PolicyAgent class is responsible for selecting moves according to the model and changing its behavior in response to its experience. But it can ignore the details of the model structure and the board-encoding scheme.

Listing 9.7. The constructor for the PolicyAgent class
class PolicyAgent(Agent):
    def __init__(self, model, encoder):
        self.model = model              1
        self.encoder = encoder          2

  • 1 A Keras Sequential model instance
  • 2 Implements the Encoder interface

To start the reinforcement-learning process, you first construct a board encoder, then a model, and finally the agent. The following listing shows this process.

Listing 9.8. Constructing a new learning agent
encoder = encoders.simple.SimpleEncoder((board_size, board_size))
model = Sequential()                                              1
for layer in dlgo.networks.large.layers(encoder.shape()):         1
    model.add(layer)                                              1
model.add(Dense(encoder.num_points()))                            2
model.add(Activation('softmax'))                                  2
new_agent = agent.PolicyAgent(model, encoder)

  • 1 Builds a Sequential model out of the layers described in dlgo.networks.large (covered in chapter 6)
  • 2 Adds an output layer that will return a probability distribution over points on the board

When you construct an agent like this, using a newly created model, Keras initializes the model weights to small, random values. At this point, the agent’s policy will be close to uniform random: it’ll choose any valid move with roughly equal probability. Later, training the model will add structure to its decisions.

9.3.4. Loading and saving your agent from disk

The reinforcement-learning process can continue indefinitely; you may spend days or even weeks training your bot. You’ll want to periodically persist your bot to disk so you can start and stop the training process, and compare its performance at different points in the training cycle.

You can use the HDF5 file format, which we introduced in chapter 8, to store your agent. The HDF5 format is a convenient way to store numerical arrays, and it integrates nicely with NumPy and Keras.

A serialize method on your PolicyAgent class can persist its encoder and model to disk, which is enough to re-create the agent.

Listing 9.9. Serializing a PolicyAgent to disk
class PolicyAgent(Agent):
...
    def serialize(self, h5file):
        h5file.create_group('encoder')
        h5file['encoder'].attrs['name'] = self.encoder.name() 1
        h5file['encoder'].attrs['board_width'] =             1
            self.encoder.board_width                          1
        h5file['encoder'].attrs['board_height'] =            1
            self.encoder.board_height                         1
        h5file.create_group('model')
        kerasutil.save_model_to_hdf5_group(                   2
            self._model, h5file['model'])                     2

  • 1 Stores enough information to reconstruct the board encoder
  • 2 Uses built-in Keras features to persist the model and its weights

The h5file argument could be an h5py.File object, or it could be a group inside an h5py.File. This allows you to bundle other data with the agent in a single HDF5 file.

To use this serialize method, you first create a new HDF5 file, and then pass in the file handle.

Listing 9.10. An example of using the serialize function
import h5py

with h5py.File(output_file, 'w') as outf:
    agent.serialize(outf)

Then a corresponding load_policy_agent function reverses the procedure.

Listing 9.11. Loading a policy agent from a file
def load_policy_agent(h5file):
    model = kerasutil.load_model_from_hdf5_group(           1
        h5file['model'])                                    1
    encoder_name = h5file['encoder'].attrs['name']          2
    board_width = h5file['encoder'].attrs['board_width']    2
    board_height = h5file['encoder'].attrs['board_height']  2
    encoder = encoders.get_encoder_by_name(                 2
        encoder_name,                                       2
        (board_width, board_height))                        2
    return PolicyAgent(model, encoder)                      3

  • 1 Uses built-in Keras functions to load the model structure and weights
  • 2 Recovers the board encoder
  • 3 Reconstructs the agent

9.3.5. Implementing move selection

The PolicyAgent needs one more function before you can begin self-play: the select_move implementation. This function will look similar to the select_move function you added to the DeepLearningAgent from chapter 8. The first step is to encode the board as a tensor (a stack of matrices; see appendix A) suitable for feeding into the model. Next, you feed the board tensor to the model and get back a probability distribution of the moves. You then clip the distribution to make sure no probability goes all the way to 1 or 0. Figure 9.3 illustrates the flow of this process. Listing 9.12 shows how to implement these steps.

Listing 9.12. Selecting a move with a neural network
class PolicyAgent(Agent):
...
    def select_move(self, game_state):
        board_tensor = self._encoder.encode(game_state)
        X = np.array([board_tensor])                             1
        move_probs = self._model.predict(X)[0]                   1

        move_probs = clip_probs(move_probs)

        num_moves = self._encoder.board_width *                 2
            self._encoder.board_height                           2
        candidates = np.arange(num_moves)                        2
        ranked_moves = np.random.choice(                         3
            candidates, num_moves,                               3
            replace=False, p=move_probs)                         3

        for point_idx in ranked_moves:                           4
            point = self._encoder.decode_point_index(point_idx)  4
            move = goboard.Move.play(point)                      4
            is_valid = game_state.is_valid_move(move)            4
            is_an_eye = is_point_an_eye(                         4
                game_state.board,                                4
                point,                                           4
                game_state.next_player)                          4
            if is_valid and (not is_an_eye):                     4
                return goboard.Move.play(point)                  4
        return goboard.Move.pass_turn()                          5

  • 1 The Keras predict call makes batch predictions, so you wrap your single board in an array and pull out the first item from the resulting array.
  • 2 Creates an array containing the index of every point on the board
  • 3 Samples from the points on the board according to the policy, creates a ranked list of points to try
  • 4 Loops over each point, checks if it’s a valid move, and picks the first valid on
  • 5 If you fall through here, there are no reasonable moves left.
Figure 9.3. The move-selection process. First you encode a game state as a numerical tensor; then you can pass that tensor to your model to get move probabilities. You sample from all points on the board according to the move probabilities to get an order in which to try the moves.

9.4. Self-play: how a computer program practices

Now that you have a learning agent capable of completing a game, you can begin collecting experience data. For a Go AI, this means playing thousands of games. This section shows how to implement this process. First, we describe some data structures to make handling experience data more convenient. Next, we show how to implement the self-play driver program.

9.4.1. Representing experience data

Experience data contains three parts: states, actions, and rewards. To help keep these organized, you can create a single data structure that holds all three of these together.

The ExperienceBuffer class is a minimal container for an experience data set. It has three attributes: states, actions, and rewards. All of these are represented as NumPy arrays; your agent will be responsible for encoding its states and actions as numerical structures. The ExperienceBuffer is nothing more than a container for passing the data set around. Nothing in this implementation is specific to policy gradient learning; you can reuse this class with other RL algorithms in later chapters. So you’ll add this class to the dlgo/rl/experience.py module.

Listing 9.13. Constructor for an experience buffer
class ExperienceBuffer:
    def __init__(self, states, actions, rewards):
        self.states = states
        self.actions = actions
        self.rewards = rewards

After you’ve collected a large experience buffer, you’ll want a way to persist it to disk. The HDF5 file format is a perfect fit once again. You can add a serialize method to the ExperienceBuffer class.

Listing 9.14. Saving an experience buffer to disk
class ExperienceBuffer:
...
    def serialize(self, h5file):
        h5file.create_group('experience')
        h5file['experience'].create_dataset(
            'states', data=self.states)
        h5file['experience'].create_dataset(
            'actions', data=self.actions)
        h5file['experience'].create_dataset(
            'rewards', data=self.rewards)

You’ll also need a corresponding function, load_experience, to read the experience buffer back out of the file. Note that you cast each data set to np.array when reading it: that’ll read the entire dataset into memory.

Listing 9.15. Restoring an ExperienceBuffer from an HDF5 file
def load_experience(h5file):
    return ExperienceBuffer(
        states=np.array(h5file['experience']['states']),
        actions=np.array(h5file['experience']['actions']),
        rewards=np.array(h5file['experience']['rewards']))

Now you have a simple container for passing around experience data. You still need a way to fill it with your agent’s decisions. The complication is that the agent makes decisions one at a time, but it doesn’t get a reward until the game is over and you know who won. To resolve this, you need to keep track of all the decisions from the current episode until it’s complete. One option is to put this logic directly in the agent, but this will clutter up the implementation of PolicyAgent. Alternately, you can separate this out into a discrete ExperienceCollector object whose sole responsibility is episode-by-episode bookkeeping.

The ExperienceCollector implements four methods:

  • begin_episode and complete_episode, which are called by the self-play driver to indicate the start and end of a single game.
  • record_decision, which is called by the agent to indicate a single action it chose.
  • to_buffer, which packages up everything the ExperienceCollector has recorded and returns an ExperienceBuffer. The self-play driver will call this at the end of a self-play session.

The full implementation appears in the following listing.

Listing 9.16. An object to track decisions within a single episode
class ExperienceCollector:
    def __init__(self):
        self.states = []
        self.actions = []
        self.rewards = []
        self.current_episode_states = []
        self.current_episode_actions = []

    def begin_episode(self):
        self.current_episode_states = []
        self.current_episode_actions = []

    def record_decision(self, state, action):
        self.current_episode_states.append(state)             1
        self.current_episode_actions.append(action)           1

    def complete_episode(self, reward):
        num_states = len(self.current_episode_states)
        self.states += self.current_episode_states
        self.actions += self.current_episode_actions
        self.rewards += [reward for _ in range(num_states)]   2

        self.current_episode_states = []
        self.current_episode_actions = []

    def to_buffer(self):
               return ExperienceBuffer(
                         states=np.array(self.states),
            actions=np.array(self.actions),                   3
            rewards=np.array(self.rewards)                    3
        )

  • 1 Saves a single decision in the current episode; the agent is responsible for encoding the state and action.
  • 2 Spreads the final reward across every action in the game
  • 3 The ExperienceCollector accumulates Python lists; this converts them to NumPy arrays.

To integrate the ExperienceCollector with your agent, you can add a set_collector method that tells the agent where to send its experiences. Then inside select_move, the agent will notify the collector every time it makes a decision.

Listing 9.17. Integrating an ExperienceCollector with a PolicyAgent
class PolicyAgent:
...
    def set_collector(self, collector):           1
        self.collector = collector                1
...
    def select_move(self, game_state):
...
        if self.collector is not None:            2
                self.collector.record_decision(   2
                    state=board_tensor,           2
                    action=point_idx              2
                )
        return goboard.Move.play(point)

  • 1 Allows the self-play driver program to attach a collector to the agent
  • 2 At the time it chooses a move, notifies the collector of the decision

9.4.2. Simulating games

The next step is playing the games. You’ve done this twice before in the book: in the bot_v_bot demo in chapter 3, and as part of the Monte Carlo tree-search implementation in chapter 4. You can use the same simulate_game implementation here.

Listing 9.18. Simulating a game between two agents
def simulate_game(black_player, white_player):
    game = GameState.new_game(BOARD_SIZE)
    agents = {
        Player.black: black_player,
        Player.white: white_player,
    }
    while not game.is_over():
        next_move = agents[game.next_player].select_move(game)
        game = game.apply_move(next_move)
    game_result = scoring.compute_game_result(game)
    return game_result.winner

In this function, black_player and white_player could be any instance of your Agent class. You can match up the PolicyAgent that you’re training against any opponent you like. Theoretically, the opponent could be a human player, although it’d take ages to collect enough experience data that way. Or your learner could play against a third-party Go bot, perhaps using the GTP framework from chapter 8 to handle the communications.

You can also just match up your learning agent with a copy of itself. Besides the simplicity of this solution, there are two specific advantages.

First, reinforcement learning needs plenty of both successes and failures to learn from. Imagine playing your first-ever game of chess or Go against a grandmaster. As a novice, you’d be so far behind it would be impossible to tell where you went wrong, and the experienced player could probably make a few mistakes and still win comfortably. As a result, neither player would learn much from the game. Instead, beginners usually start against other beginners and work their way up slowly. The same principle applies in reinforcement learning. When your bot plays itself, it’ll always have an equal-strength opponent.

Second, by playing your agent against itself, you get two games for the price of one. Because the same decision-making process went into both sides of the game, you can learn from both the winning side and the losing side. You’ll need huge volumes of games for reinforcement learning, so generating them twice as fast is a nice bonus.

To start the self-play process, you construct two copies of your agent and assign them each an ExperienceCollector. Each agent needs its own collector because the two agents will see different rewards at the end of a game. Listing 9.19 shows this initialization step.

Reinforcement learning beyond games

Self-play is a great technique for collecting experience data for board games. In other domains, you’ll need to separately build a simulated environment to run your agent. For example, if you want to use reinforcement learning to build a control system for a robot, you’d need a detailed simulation of the physical environment the robot will operate in.

If you want to experiment further with reinforcement learning, the OpenAI Gym (https://github.com/openai/gym) is a useful resource. It provides environments for a variety of board games, video games, and physical simulations.

Listing 9.19. Initialization for generating a batch of experience
agent1 = agent.load_policy_agent(h5py.File(agent_filename))
agent2 = agent.load_policy_agent(h5py.File(agent_filename))
collector1 = rl.ExperienceCollector()
collector2 = rl.ExperienceCollector()
agent1.set_collector(collector1)
agent2.set_collector(collector2)

Now you’re ready to implement the main loop that simulates the self-play games. In this loop, agent1 will always play as black, while agent2 will always play as white. This is fine, so long as agent1 and agent2 are identical and you intend to combine their experiences for training. If your learning agent is playing against another reference agent, you’ll want it to alternate between black and white. In Go, black and white have slightly different personalities due to black playing first, so a learning agent needs to practice from both sides.

Listing 9.20. Playing a batch of games
for i in range(num_games):
    collector1.begin_episode()
    collector2.begin_episode()

    game_record = simulate_game(agent1, agent2)
    if game_record.winner == Player.black:
        collector1.complete_episode(reward=1)       1
        collector2.complete_episode(reward=-1)      1
    else:
        collector2.complete_episode(reward=1)       2
        collector1.complete_episode(reward=-1)      2

  • 1 agent1 won the game, so it gets a positive reward.
  • 2 agent2 won the game.

When the self-play is complete, the last step is to combine all the collected experience and save it in a file. That file provides the input for the training script, which we cover in the next chapter.

Listing 9.21. Saving a batch of experience data
experience = rl.combine_experience([                           1
    collector1,
    collector2])
with h5py.File(experience_filename, 'w') as experience_outf:   2
    experience.serialize(experience_outf)

  • 1 Merges both agents’ experience into a single buffer
  • 2 Saves into an HDF5 file

At this point, you’re ready to generate self-play games. The next chapter shows you how to start improving your bot from the self-play data.

9.5. Summary

  • An agent is a computer program that’s supposed to accomplish a certain task. For example, our Go-playing AI is an agent with the goal of winning games of Go.
  • The reinforcement-learning cycle involves collecting experience data, training the agent from the experience data, and evaluating the updated agent. At the end of a cycle, you expect a small improvement in your agent’s performance. Ideally, you can repeat this cycle many times to continually improve your agent.
  • To apply reinforcement learning to a problem, you must describe the problem in terms of states, actions, and rewards.
  • Rewards are the way you control the behavior of your reinforcement-learning agent. You can provide positive rewards for outcomes you want your agent to achieve, and negative rewards for outcomes you want your agent to avoid.
  • A policy is a rule for making decisions from a given state. In a Go AI, the algorithm that selects a move from a board position is its policy.
  • You can make a policy out of a neural network by treating the output vector as a probability distribution over possible actions, and then sampling from the probability distribution.
  • When applying reinforcement learning to games, you can collect experience data through self-play: your agent plays games against a copy of itself.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset