Chapter 7. Learning from data: a deep-learning bot

This chapter covers

  • Downloading and processing actual Go game records
  • Understanding the standard format for storing Go games
  • Training a deep-learning model for move prediction with such data
  • Using sophisticated Go board encoders to create strong bots
  • Running your own experiments and evaluating them

In the preceding chapter, you saw many essential ingredients for building a deep-learning application, and you built a few neural networks to test the tools you learned about. One of the key things you’re still missing is good data to learn from. A supervised deep neural network is only as good as the data you feed it—and so far, you’ve had only self-generated data at your disposal.

In this chapter, you’ll learn about the most common data format for Go data, the Smart Game Format (SGF). You can obtain historical game records in SGF from practically every popular Go server. To power a deep neural network for Go move prediction, in this chapter you’ll download many SGF files from a Go server, encode them in a smart way, and train a neural network with this data. The resulting trained network will be much stronger than any previous models in earlier chapters.

Figure 7.1 illustrates what you can build at the end of this chapter.

Figure 7.1. Building a deep-learning Go bot, using real-world Go data for training. You can find game records from public Go servers to use for training a bot. In this chapter, you’ll learn how to find those records, transform them into a training set, and train a Keras model to imitate the human players’ decisions.

At the end of this chapter, you can run your own experiments with complex neural networks to build a strong bot completely on your own. To get started, you need access to real-world Go data.

7.1. Importing Go game records

All the Go data you used up to this point has been generated by yourself. In the preceding chapter, you trained a deep neural network to predict moves for generated data. The best you could hope for was that your network could perfectly predict these moves, in which case the network would play as well as your tree-search algorithm that generated the data. In a way, the data you feed into the network provides an upper bound to a deep-learning bot trained from it. The bot can’t outperform the strength of the players generating the data.

By using records of games played by strong human opponents as input to your deep neural networks, you can considerably improve the strength of your bots. You’ll use game data from the KGS Go Server (formerly known as Kiseido Go Server), one of the most popular Go platforms in the world. Before explaining how to download and process data from KGS, we’ll first introduce you to the data format your Go data comes in.

7.1.1. The SGF file format

The Smart Game Format (SGF), initially called Smart Go Format, has been developed since the late 80s. Its current, fourth major release (denoted FF[4]) was released in the late 90s. SGF is a straightforward, text-based format that can be used to express games of Go, variations of Go games (for instance, extended game commentaries by professional players), and other board games. For the rest of this chapter, you’ll assume that the SGF files you’re dealing with consist of Go games without any variations. In this section, we teach you a few basics of this rich game format, but if you want to learn more about it, start with https://senseis.xmp.net/?SmartGameFormat at Sensei’s Library.

At its core, SGF consists of metadata about the game and the moves played. You can specify metadata by two capital letters, encoding a property, and a respective value in square brackets. For instance, a Go game played on a board of size (SZ) 9 × 9 would be encoded as SZ[9] in SGF. Go moves are encoded as follows: a white move on the third row and third column of the board is W[cc] in SGF, whereas a black move on the seventh row and third column is represented as B[gc]; the letters B and W stand for stone color, and coordinates for rows and columns are indexed alphabetically. To represent a pass, you use the empty moves B[] and W[].

The following example of an SGF file is taken from the complete 9 × 9 example at the end of chapter 2. It shows a game of Go (Go has game number 1, or GM[1], in SGF) in the current SGF version (FF[4]), played on a 9 × 9 board with zero handicap (HA[0]), and 6.5 points komi for white as compensation for black getting the first move (KM[6.5]). The game is played under Japanese rule settings (RU[Japanese]) and results (RE) in a 9.5-point win by white (RE[W+9.5]):

(;FF[4] GM[1] SZ[9] HA[0] KM[6.5] RU[Japanese] RE[W+9.5]
;B[gc];W[cc];B[cg];W[gg];B[hf];W[gf];B[hg];W[hh];B[ge];W[df];B[dg]
;W[eh];B[cf];W[be];B[eg];W[fh];B[de];W[ec];B[fb];W[eb];B[ea];W[da]
;B[fa];W[cb];B[bf];W[fc];B[gb];W[fe];B[gd];W[ig];B[bd];W[he];B[ff]
;W[fg];B[ef];W[hd];B[fd];W[bi];B[bh];W[bc];B[cd];W[dc];B[ac];W[ab]
;B[ad];W[hc];B[ci];W[ed];B[ee];W[dh];B[ch];W[di];B[hb];W[ib];B[ha]
;W[ic];B[dd];W[ia];B[];
 TW[aa][ba][bb][ca][db][ei][fi][gh][gi][hf][hg][hi][id][ie][if]
  [ih][ii]
 TB[ae][af][ag][ah][ai][be][bg][bi][ce][df][fe][ga]
 W[])

An SGF file is organized as a list of nodes, which are separated by semicolons. The first node contains metadata about the game: board size, rule set, game result, and other background information. Each following node represents a move in the game. Whitespace is completely unimportant; you could collapse the whole example string into a single line, and it would still be valid SGF. At the end, you also see the points that belong to white’s territory, listed under TW, and the ones that belong to black, under TB. Note that the territory indicators are part of the same node as white’s last move (W[], indicating a pass): you can consider them as a sort of comment on that position in the game.

This example illustrates some of the core properties of SGF files, and shows everything you’ll need for replaying game records in order to generate training data. The SGF format supports many more features, but those are mainly useful for adding commentary and annotations to game records, so you won’t need them for this book.

7.1.2. Downloading and replaying Go game records from KGS

If you go to the page https://u-go.net/gamerecords/, you’ll see a table with game records available for download in various formats (zip, tar.gz). This game data has been collected from the KGS Go Server since 2001 and contains only games played in which at least one of the players was 7 dan or above, or both players were 6 dan. Recall from chapter 2 that dan ranks are master ranks, ranging from 1 dan to 9 dan, so these are games played by strong players. Also note that all of these games were played on a 19 × 19 board, whereas in chapter 6 we used only data generated for the much less complex situation of a 9 × 9 board.

This is an incredibly powerful data set for Go move prediction, which you’ll use in this chapter to power a strong deep-learning bot. You want to download this data in an automated fashion by fetching the HTML containing the links to individual files, unpacking the files, and then processing the SGF game records contained in them.

As a first step toward using this data as input to deep-learning models, you create a new submodule called data within your main dlgo module, that you provide with an empty __init__.py, as usual. This submodule will contain everything related to Go data processing needed for this book.

Next, to download game data, you create a class called KGSIndex in the new file index_processor.py within the data submodule. Because this step is entirely technical and contributes to neither your Go nor your machine-learning knowledge, we omit the implementation here. If you’re interested in the details, the code can be found in our GitHub repository. The KGSIndex implementation found there has precisely one method that you’ll use later: download_files. This method will mirror the page https://u-go.net/gamerecords/ locally, find all relevant download links, and then download the respective tar.gz files in a separate folder called data. Here’s how you can call it.

Listing 7.1. Creating an index of zip files containing Go data from KGS
from dlgo.data.index_processor import KGSIndex


index = KGSIndex()
index.download_files()

Running this should result in a command-line output that looks as follows:

>>> Downloading index page
KGS-2017_12-19-1488-.tar.gz 1488
KGS-2017_11-19-945-.tar.gz 945

...

>>> Downloading data/KGS-2017_12-19-1488-.tar.gz
>>> Downloading data/KGS-2017_11-19-945-.tar.gz

...

Now that you have this data stored locally, let’s move on to processing it for use in a neural network.

7.2. Preparing Go data for deep learning

In chapter 6, you saw a simple encoder for Go data that was already presented in terms of the Board and GameState classes introduced in chapter 3. When working with SGF files, you first have to extract their content (the unpacking we referred to earlier) and replay a game from them, so that you can create the necessary game state information for your Go playing framework.

7.2.1. Replaying a Go game from an SGF record

Reading out an SGF file for Go game state information means understanding and implementing the format specifications. Although this isn’t particularly hard to do (in the end, it’s just imposing a fixed set of rules on a string of text), it’s also not the most exciting aspect of building a Go bot and takes a lot of effort and time to do flawlessly. For these reasons, we’ll introduce another submodule within dlgo called gosgf that’s responsible for handling all the logic of processing SGF files. We treat this submodule as a black box within this chapter and refer you to our GitHub repository for more information on how to read and interpret SGF with Python.

Note

The gosgf module is adapted from the Gomill Python library, available at https://mjw.woodcraft.me.uk/gomill/.

You’ll need precisely one entity from gosgf that’s sufficient to process everything you need: Sgf_game. Let’s see how you can use Sgf_game to load a sample SGF game, read out game information move by move, and apply the moves to a GameState object. Figure 7.2 shows the beginning of a Go game in terms of SGF commands.

Figure 7.2. Replaying a game record from an SGF file. The original SGF file encodes game moves with strings such as B[ee]. The Sgf_game class decodes those strings and returns them as Python tuples. You can then apply these moves to your GameState object to reconstruct the game, as shown in the following listing.

Listing 7.2. Replaying moves from an SGF file with your Go framework
from dlgo.gosgf import Sgf_game                            1
from dlgo.goboard_fast import GameState, Move
from dlgo.gotypes import Point
from dlgo.utils import print_board


sgf_content = "(;GM[1]FF[4]SZ[9];B[ee];W[ef];B[ff]" +     2
              ";W[df];B[fe];W[fc];B[ec];W[gd];B[fb])"

sgf_game = Sgf_game.from_string(sgf_content)               3

game_state = GameState.new_game(19)

for item in sgf_game.main_sequence_iter():                4

    color, move_tuple = item.get_move()                   5
    if color is not None and move_tuple is not None:
        row, col = move_tuple
        point = Point(row + 1, col + 1)
        move = Move.play(point)
        game_state = game_state.apply_move(move)          6
        print_board(game_state.board)

  • 1 Import the Sgf_game class from the new gosgf module first.
  • 2 Define a sample SGF string. This content will come from downloaded data later.
  • 3 With the from_string method, you can create an Sgf_game.
  • 4 Iterate over the game’s main sequence; you ignore variations and commentaries.
  • 5 Items in this main sequence come as (color, move) pairs, where “move” is a pair of board coordinates.
  • 6 The read-out move can then be applied to your current game state.

In essence, after you have a valid SGF string, you create a game from it, whose main sequence you can iterate over and process however you want. Listing 7.2 is central to this chapter and gives you a rough outline for how you’ll proceed to process Go data for deep learning:

  1. Download and unpack the compressed Go game files.
  2. Iterate over each SGF file contained in these files, read them as Python strings, and create an Sgf_game from these strings.
  3. Read out the main sequence of the Go game for each SGF string, make sure to take care of important details such as placing handicap stones, and feed the resulting move data into a GameState object.
  4. For each move, encode the current board information as features with an Encoder and store the move itself as a label, before placing it on the board. This way, you’ll create move prediction data for deep learning on the fly.
  5. Store the resulting features and labels in a suitable format so you can pick it up later and feed it into a deep neural network.

Throughout the next few sections, you’ll tackle these five tasks in great detail. After processing data like this, you can go back to your move-prediction application and see how this data affects move-prediction accuracy.

7.2.2. Building a Go data processor

In this section, you’ll build a Go data processor that can transform raw SGF data into features and labels for a machine-learning algorithm. This is going to be a relatively long implementation, so we split it into several parts. When you’re finished, you’ll have everything ready to run a deep-learning model on real data.

To get started, create a new file called processor.py within your new data submodule. As before, it’s also completely fine to just follow the implementation here on a copy of processor.py from the GitHub repository. Let’s import a few core Python libraries that you’ll work with in processor.py. Apart from NumPy for data, you’ll need quite a few packages for processing files.

Listing 7.3. Python libraries needed for data and file processing
import os.path
import tarfile
import gzip
import glob
import shutil

import numpy as np
from keras.utils import to_categorical

As for functionality needed from dlgo itself, you need to import many of the core abstractions you’ve built so far.

Listing 7.4. Imports for data processing from the dlgo module.
from dlgo.gosgf import Sgf_game
from dlgo.goboard_fast import Board, GameState, Move
from dlgo.gotypes import Player, Point
from dlgo.encoders.base import get_encoder_by_name

from dlgo.data.index_processor import KGSIndex
from dlgo.data.sampling import Sampler                 1

  • 1 Sampler will be used to sample training and test data from files

We haven’t yet discussed the two last imports in the listing (Sampler and DataGenerator) but will introduce them as we build our Go data processor. Continuing with processor.py, a GoDataProcessor is initialized by providing an Encoder as string and a data_directory to store SGF data in.

Listing 7.5. Initializing a Go data processor with an encoder and a local data directory
class GoDataProcessor:
    def __init__(self, encoder='oneplane', data_directory='data'):
        self.encoder = get_encoder_by_name(encoder, 19)
        self.data_dir = data_directory

Next, you’ll implement the main data processing method, called load_go_data. In this method, you can specify the number of games you’d like to process, as well as the type of data to load, meaning either training or test data. load_go_data will download online Go records form KGS, sample the specified number of games, process them by creating features and labels, and then persist the result locally as NumPy arrays.

Listing 7.6. load_go_data loads, processes, and stores data
    def load_go_data(self, data_type='train',                          1
                     num_samples=1000):                                2
        index = KGSIndex(data_directory=self.data_dir)
        index.download_files()                                         3

        sampler = Sampler(data_dir=self.data_dir)
        data = sampler.draw_data(data_type, num_samples)               4

        zip_names = set()
        indices_by_zip_name = {}
        for filename, index in data:
            zip_names.add(filename)                                    5
            if filename not in indices_by_zip_name:
                indices_by_zip_name[filename] = []
            indices_by_zip_name[filename].append(index)                6
        for zip_name in zip_names:
            base_name = zip_name.replace('.tar.gz', '')
            data_file_name = base_name + data_type
            if not os.path.isfile(self.data_dir + '/' + data_file_name):
                self.process_zip(zip_name, data_file_name,
                                 indices_by_zip_name[zip_name])        7

        features_and_labels = self.consolidate_games(data_type, data)  8
        return features_and_labels

  • 1 For data_type, you can choose either train or test.
  • 2 num_samples refers to the number of games to load data from.
  • 3 Download all games from KGS to your local data directory. If data is available, it won’t be downloaded again.
  • 4 The Sampler instance selects the specified number of games for a data type.
  • 5 Collect all zip file names contained in the data in a list.
  • 6 Group all SGF file indices by zip file name.
  • 7 The zip files are then processed individually.
  • 8 Features and labels from each zip are then aggregated and returned.

Note that after downloading data, you split it by using a Sampler instance. All this sampler does is make sure it randomly picks the specified number of games, but more importantly, that training and test data don’t overlap in any way. Sampler does that by splitting training and test data on a file level, by simply declaring games played prior to 2014 as test data and newer games as training data. Doing so, you make absolutely sure that no game information available in test data is also (partly) included in training data, which may lead to overfitting of your models.

Splitting training and test data

The reason you split data into training and test data is to obtain reliable performance metrics. You train a model on training data and evaluate it on test data to see how well the model adapts to previously unseen situations, how well it extrapolates from what it learned in the training phase to the real world. Proper data collection and split is crucially important to trust the results you get from a model.

It can be tempting to just load all the data you have, shuffle it, and randomly split it into training and test data. Depending on the problem at hand, this naive approach may or may not be a good idea. If you think of Go game records, the moves within a single game depend on each other. Training a model on a set of moves that are also included in the test set can lead to the illusion of having found a strong model. But it may turn out that your bot won’t be as strong in practice. Make sure to spend time analyzing your data and find a split that makes sense.

After downloading and sampling data, load_go_data relies essentially on helpers to process data: process_zip to read out individual zip files, and consolidate_games to group the results from each zip into one set of features and labels. Let’s have a look at process_zip next, which carries out the following steps for you:

  1. Unzip the current file by using unzip_data.
  2. Initialize an Encoder instance to encode SGF records.
  3. Initialize feature and label NumPy arrays of the right shape.
  4. Iterate through the game list and process games one by one.
  5. For each game, first apply all handicap stones.
  6. Then read out each move as found in the SGF record.
  7. For each next move, encode the move as label.
  8. Encode the current board state as feature.
  9. Apply the next move to the board and proceed.
  10. Store small chunks of features and labels in the local filesystem.

Here’s how you implement the first nine of these steps in process_zip. Note that the technical utility method unzip_data has been omitted for brevity, but can be found in our GitHub repository. In figure 7.3, you see how processing zipped SGF files into an encoded game state works.

Figure 7.3. The process_zip function. You iterate over a zip file that contains many SGF files. Each SGF file contains a sequence of game moves; you use those to reconstruct GameState objects. Then you use an Encoder object to convert each game state to a NumPy array.

Next, you can define process_zip.

Listing 7.7. Processing Go records stored in zip files into encoded features and labels
    def process_zip(self, zip_file_name, data_file_name, game_list):
        tar_file = self.unzip_data(zip_file_name)
        zip_file = tarfile.open(self.data_dir + '/' + tar_file)
        name_list = zip_file.getnames()
        total_examples = self.num_total_examples(zip_file, game_list,
                                                 name_list)                1

        shape = self.encoder.shape()                                       2
        feature_shape = np.insert(shape, 0, np.asarray([total_examples]))
        features = np.zeros(feature_shape)
        labels = np.zeros((total_examples,))

        counter = 0
        for index in game_list:
            name = name_list[index + 1]
            if not name.endswith('.sgf'):
                raise ValueError(name + ' is not a valid sgf')
            sgf_content = zip_file.extractfile(name).read()
            sgf = Sgf_game.from_string(sgf_content)                        3

            game_state, first_move_done = self.get_handicap(sgf)           4

            for item in sgf.main_sequence_iter():                          5
                color, move_tuple = item.get_move()
                point = None
                if color is not None:
                    if move_tuple is not None:                             6
                        row, col = move_tuple
                        point = Point(row + 1, col + 1)
                        move = Move.play(point)
                    else:
                        move = Move.pass_turn()                            7
                    if first_move_done and point is not None:
                        features[counter] = self.encoder.encode(game_state)8
                        labels[counter] = self.encoder.encode_point(point) 9
                        counter += 1
                    game_state = game_state.apply_move(move)               10
                    first_move_done = True

  • 1 Determines the total number of moves in all games in this zip file
  • 2 Infers the shape of features and labels from the encoder you use
  • 3 Reads the SGF content as string, after extracting the zip file
  • 4 Infers the initial game state by applying all handicap stones
  • 5 Iterates over all moves in the SGF file
  • 6 Reads the coordinates of the stone to be played...
  • 7 ...or passes, if there is none
  • 8 Encodes the current game state as features...
  • 9 ...and the next move as label for the features
  • 10 Afterward, the move is applied to the board, and you proceed with the next one.

Note how closely the for loop resembles the process you sketched in listing 7.2, so this code should feel familiar to you. process_zip uses two helper methods that you’ll implement next. The first one is num_total_examples, which precomputes the number of moves available per zip file so that you can efficiently determine the size of feature and label arrays.

Listing 7.8. Calculating the total number of moves available in the current zip file
    def num_total_examples(self, zip_file, game_list, name_list):
        total_examples = 0
        for index in game_list:
            name = name_list[index + 1]
            if name.endswith('.sgf'):
                sgf_content = zip_file.extractfile(name).read()
                sgf = Sgf_game.from_string(sgf_content)
                game_state, first_move_done = self.get_handicap(sgf)

                num_moves = 0
                for item in sgf.main_sequence_iter():
                    color, move = item.get_move()
                    if color is not None:
                        if first_move_done:
                            num_moves += 1
                        first_move_done = True
                total_examples = total_examples + num_moves
            else:
                raise ValueError(name + ' is not a valid sgf')
        return total_examples

You use the second helper method to figure out the number of handicap stones the current game has and apply these moves to an empty board.

Listing 7.9. Retrieving handicap stones and applying them to an empty Go board
    @staticmethod
    def get_handicap(sgf):
        go_board = Board(19, 19)
        first_move_done = False
        move = None
        game_state = GameState.new_game(19)
        if sgf.get_handicap() is not None and sgf.get_handicap() != 0:
            for setup in sgf.get_root().get_setup_stones():
                for move in setup:
                    row, col = move
                    go_board.place_stone(Player.black,
                                         Point(row + 1, col + 1))
            first_move_done = True
            game_state = GameState(go_board, Player.white, None, move)
        return game_state, first_move_done

To finish the implementation of process_zip, you store chunks of features and labels in separate files.

Listing 7.10. Persisting features and labels locally in small chunks
        feature_file_base = self.data_dir + '/' + data_file_name +
     '_features_%d'
        label_file_base = self.data_dir + '/' + data_file_name + '_labels_%d'

        chunk = 0  # Due to files with large content, split up after chunksize
        chunksize = 1024
        while features.shape[0] >= chunksize:                              1
            feature_file = feature_file_base % chunk
            label_file = label_file_base % chunk
            chunk += 1
            current_features, features = features[:chunksize],
 features[chunksize:]
            current_labels, labels = labels[:chunksize], labels[chunksize:]2
            np.save(feature_file, current_features)
            np.save(label_file, current_labels)                            3

  • 1 You process features and labels in chunks of size 1024.
  • 2 The current chunk is cut off from features and labels...
  • 3 ... and then stored in a separate file.

The reason you store small chunks is that NumPy arrays can become large quickly, and storing data in smaller files enables more flexibility later. For instance, you could either consolidate data for all chunks or load chunks into memory as needed. You’ll work with both approaches. Although the latter—dynamically loading batches of data as you go—is a little more intricate, consolidating data is straightforward. As a side note, in our implementation you potentially lose the last fraction of a chunk in the while loop, but this is insubstantial because you have more than enough data at your disposal.

Continuing with processor.py and our definition of GoDataProcessor, you simply concatenate all arrays into one.

Listing 7.11. Consolidating individual NumPy arrays of features and labels into one set
    def consolidate_games(self, data_type, samples):
        files_needed = set(file_name for file_name, index in samples)
        file_names = []
        for zip_file_name in files_needed:
            file_name = zip_file_name.replace('.tar.gz', '') + data_type
            file_names.append(file_name)

        feature_list = []
        label_list = []
        for file_name in file_names:
            file_prefix = file_name.replace('.tar.gz', '')
            base = self.data_dir + '/' + file_prefix + '_features_*.npy'
            for feature_file in glob.glob(base):
                label_file = feature_file.replace('features', 'labels')
                x = np.load(feature_file)
                y = np.load(label_file)
                x = x.astype('float32')
                y = to_categorical(y.astype(int), 19 * 19)
                feature_list.append(x)
                label_list.append(y)
        features = np.concatenate(feature_list, axis=0)
        labels = np.concatenate(label_list, axis=0)
        np.save('{}/features_{}.npy'.format(self.data_dir, data_type),
 features)
        np.save('{}/labels_{}.npy'.format(self.data_dir, data_type), labels)

        return features, labels

You can test this implementation by loading features and labels for 100 games as follows.

Listing 7.12. Loading training data from 100 game records
from dlgo.data.processor import GoDataProcessor

processor = GoDataProcessor()
features, labels = processor.load_go_data('train', 100)

These features and labels have been encoded with your oneplane encoder from chapter 6, meaning they have exactly the same structure. In particular, you can go ahead and train any of the networks you created in chapter 6 with the data you just created. Don’t expect too much in terms of evaluation performance if you do so. Although this real-world game data is much better than the games generated in chapter 6, you’re now working with 19 × 19 Go data, which is much more complex than games played on 9 × 9 boards.

The procedure of loading a lot of smaller files into memory for consolidation can potentially lead to out-of-memory exceptions when loading large amounts of data. You’ll address this issue in the next section by using a data generator to provide just the next mini-batch of data needed for model training.

7.2.3. Building a Go data generator to load data efficiently

The KGS index you downloaded from https://u-go.net/gamerecords/ contains well over 170,000 games, translating into many millions of Go moves to predict. Loading all of these data points into a single pair of NumPy arrays will become increasingly difficult as you load more and more game records. Your approach to consolidate games is doomed to break down at some point.

Instead, we suggest a smart replacement for consolidate_games in your GoDataProcessor. Note that in the end, all a neural network needs for training is that you feed it mini-batches of features and labels one by one. There’s no need to keep data in memory at all times. So, what you’re going to build next is a generator for Go data. If you know the concept of generators from Python, you’ll immediately recognize the pattern of what you’re building. If not, think of a generator as a function that efficiently provides you with just the next batch of data you need, when you need it.

To start, let’s initialize a DataGenerator. Put this code into generator.py inside the data module. You initialize such a generator by providing a local data_directory and samples as provided by your Sampler in GoDataProcessor.

Listing 7.13. The signature of a Go data generator
import glob
import numpy as np
from keras.utils import to_categorical


class DataGenerator:
    def __init__(self, data_directory, samples):
        self.data_directory = data_directory
        self.samples = samples
        self.files = set(file_name for file_name, index in samples)    1
        self.num_samples = None

    def get_num_samples(self, batch_size=128, num_classes=19 * 19):    2
        if self.num_samples is not None:
            return self.num_samples
        else:
            self.num_samples = 0
            for X, y in self._generate(batch_size=batch_size,
                                       num_classes=num_classes):
                self.num_samples += X.shape[0]
            return self.num_samples

  • 1 Your generator has access to a set of files that you sampled earlier.
  • 2 Depending on the application, you may need to know how many examples you have.

Next, you’ll implement a private _generate method that creates and returns batches of data. This method follows a similar overall logic as consolidate_games, with one important difference. Whereas previously you created a big NumPy array for both features and labels, you now only return, or yield, the next batch of data.

Listing 7.14. Private method to generate and yield the next batch of Go data
    def _generate(self, batch_size, num_classes):
        for zip_file_name in self.files:
            file_name = zip_file_name.replace('.tar.gz', '') + 'train'
            base = self.data_directory + '/' + file_name + '_features_*.npy'
            for feature_file in glob.glob(base):
                label_file = feature_file.replace('features', 'labels')
                x = np.load(feature_file)
                y = np.load(label_file)
                x = x.astype('float32')
                y = to_categorical(y.astype(int), num_classes)
                while x.shape[0] >= batch_size:
                    x_batch, x = x[:batch_size], x[batch_size:]
                    y_batch, y = y[:batch_size], y[batch_size:]
                    yield x_batch, y_batch                        1

  • 1 You return, or yield, batches of data as you go.

All that’s missing from your generator is a method to return a generator. Having a generator, you can explicitly call next() on it to generate batches of data for your use case. This is done as follows.

Listing 7.15. Calling the generate method to obtain a generator for model training
    def generate(self, batch_size=128, num_classes=19 * 19):
        while True:
            for item in self._generate(batch_size, num_classes):
                yield item

Before we can show you how to use such a generator to train a neural network, we have to explain how to incorporate this concept into your GoDataProcessor.

7.2.4. Parallel Go data processing and generators

You may have noticed that loading just 100 game records in listing 7.3 feels a little slower than you may have expected. Although naturally you need to download the data first, it’s the processing itself that’s relatively slow. Recall from your implementation that you process zip files sequentially. After you finish a file, you proceed to the next. But if you look closely, the processing of Go data as we presented it is what you’d call embarrassingly parallel. It takes just a little effort to process the zip files in parallel by distributing workload across all CPUs in your computer; for instance, using Python’s multiprocessing library.

In our GitHub repository, you’ll find a parallel implementation of GoDataProcessor in the data module in parallel_processor.py. If you’re interested in how this works in detail, we encourage you to go through the implementation provided there. The reason we omit the details here is that although the speedup of parallelization is of immediate benefit to you, the implementation details make the code quite a bit harder to read.

Another benefit that you get from using the parallel version of GoDataProcessor is that you can optionally use your DataGenerator with it, to return a generator instead of data.

Listing 7.16. The parallel version of load_go_data can optionally return a generator
    def load_go_data(self, data_type='train', num_samples=1000,
                     use_generator=False):
        index = KGSIndex(data_directory=self.data_dir)
        index.download_files()

        sampler = Sampler(data_dir=self.data_dir)
        data = sampler.draw_data(data_type, num_samples)

        self.map_to_workers(data_type, data)                              1
        if use_generator:
            generator = DataGenerator(self.data_dir, data)
            return generator                                              2
        else:
            features_and_labels = self.consolidate_games(data_type, data)
            return features_and_labels                                    3

  • 1 Map workload to CPUs.
  • 2 Either returns a Go data generator...
  • 3 ...or returns consolidated data as before

With the exception of the use_generator flag in the parallel extension, both GoDataProcessor versions share the same interface. Through GoDataProcessor from dlgo .data.parallel_processor, you can now use a generator to provide Go data as follows.

Listing 7.17. Loading training data from 100 game records
from dlgo.data.parallel_processor import GoDataProcessor

processor = GoDataProcessor()
generator = processor.load_go_data('train', 100, use_generator=True)

print(generator.get_num_samples())
generator = generator.generate(batch_size=10)
X, y = generator.next()

Initially loading the data still takes time, although it should speed up proportionally to the number of processors you have in your machine. After the generator has been created, calling next() returns batches instantly. Also, this way, you don’t run into trouble with exceeding memory.

7.3. Training a deep-learning model on human game-play data

Now that you have access to high-dan Go data and processed it to fit a move-prediction model, let’s connect the dots and build a deep neural network for this data. In our GitHub repository, you’ll find a module called networks within our dlgo package that you’ll use to provide example architectures of neural networks that you can use as baselines to build strong move-prediction models. For instance, you’ll find three convolutional neural networks of varying complexity in the networks module, called small.py, medium.py, and large.py. Each of these files contains a layers function that returns a list of layers that you can add to a sequential Keras model.

You’ll build a convolutional neural network consisting of four convolutional layers, followed by a final dense layer, all with ReLU activations. On top of that, you’ll use a new utility layer right before each convolutional layer—a ZeroPadding2D layer. Zero padding is an operation in which the input features are padded with 0s. Let’s say you use your one-plane encoder from chapter 6 to encode the board as a 19 × 19 matrix. If you specify a padding of 2, that means you add two columns of 0s to the left and right, as well as two rows of 0s to the top and bottom of that matrix, resulting in an enlarged 23 × 23 matrix. You use zero padding in this situation to artificially increase the input of a convolutional layer, so that the convolution operation doesn’t shrink the image by too much.

Before we show you the code, we have to discuss a small technicality. Recall that both input and output of convolutional layers are four-dimensional: we provide a mini-batch of a number of filters that are two-dimensional each (namely, they have width and height). The order in which these four dimensions (mini-batch size, number of filters, width, and height) are represented is a matter of convention, and you mainly find two such orderings in practice. Note that filters are also often referred to as channels (C), and the mini-batch size is also called number (N) of examples. Moreover, you can use shorthand for width (W) and height (H). With this notation, the two predominant orderings are NWHC and NCWH. In Keras, this ordering is called data_format, and NWHC is called channels_last, and NCWH channels_first, for somewhat obvious reasons. Now, the way you built your first Go board encoder, the one-plane encoder, is in channels first convention (an encoded board has shape 1,19,19, meaning the single encoded plane comes first). That means you have to provide data_format=channels_first as an argument to all convolutional layers. Let’s have a look at what the model from small.py looks like.

Listing 7.18. Specifying layers for a small convolutional network for Go move prediction
from keras.layers.core import Dense, Activation, Flatten
from keras.layers.convolutional import Conv2D, ZeroPadding2D


def layers(input_shape):
    return [
        ZeroPadding2D(padding=3, input_shape=input_shape,
                      data_format='channels_first'),                1
        Conv2D(48, (7, 7), data_format='channels_first'),
        Activation('relu'),

        ZeroPadding2D(padding=2, data_format='channels_first'),     2
        Conv2D(32, (5, 5), data_format='channels_first'),
        Activation('relu'),

        ZeroPadding2D(padding=2, data_format='channels_first'),
        Conv2D(32, (5, 5), data_format='channels_first'),
        Activation('relu'),

        ZeroPadding2D(padding=2, data_format='channels_first'),
        Conv2D(32, (5, 5), data_format='channels_first'),
        Activation('relu'),

        Flatten(),
        Dense(512),
        Activation('relu'),
    ]

  • 1 Use zero padding layers to enlarge input images.
  • 2 By using channels_first, you specify that the input plane dimension for your features comes first.

The layers function returns a list of Keras layers that you can add one by one to a Sequential model. Using these layers, you can now build an application that carries out the first five steps from the overview in figure 7.1—an application that downloads, extracts, and encodes Go data and uses it to train a neural network. For the training part, you’ll use the data generator you built. But first, let’s import some of the essential components of your growing Go machine-learning library. You need a Go data processor, an encoder, and a neural network architecture to build this application.

Listing 7.19. Core imports for building a neural network for Go data
from dlgo.data.parallel_processor import GoDataProcessor
from dlgo.encoders.oneplane import OnePlaneEncoder

from dlgo.networks import small
from keras.models import Sequential
from keras.layers.core import Dense
from keras.callbacks import ModelCheckpoint         1

  • 1 With model checkpoints, you can store progress for time-consuming experiments.

The last of these imports provides a handy Keras tool called ModelCheckpoint. Because you have access to a large amount of data for training, completing a full run of training a model for some epochs can take a few hours or even days. If such an experiment fails for some reason, you better have a backup in place. And that’s precisely what model checkpoints do for you: they persist a snapshot of your model after each epoch of training. Even if something fails, you can resume training from the last checkpoint.

Next, let’s define training and test data. To do so, you first initialize a OnePlaneEncoder that you use to create a GoDataProcessor. With this processor, you can instantiate a training and a testing data generator that you’ll use with a Keras model.

Listing 7.20. Creating training and test generators
go_board_rows, go_board_cols = 19, 19
num_classes = go_board_rows * go_board_cols
num_games = 100

encoder = OnePlaneEncoder((go_board_rows, go_board_cols))                  1

processor = GoDataProcessor(encoder=encoder.name())                        2

generator = processor.load_go_data('train', num_games, use_generator=True) 3
test_generator = processor.load_go_data('test', num_games, use_generator=True)

  • 1 First you create an encoder of board size.
  • 2 Then you initialize a Go data processor with it.
  • 3 From the processor, you create two data generators, for training and testing.

As a next step, you define a neural network with Keras by using the layers function from dlgo.networks.small. You add the layers of this small network one by one to a new sequential network, and then finish off by adding a final Dense layer with softmax activation. You then compile this model with categorical cross-entropy loss and train it with SGD.

Listing 7.21. Defining a Keras model from your small layer architecture
input_shape = (encoder.num_planes, go_board_rows, go_board_cols)
network_layers = small.layers(input_shape)
model = Sequential()
for layer in network_layers:
    model.add(layer)
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd',
     metrics=['accuracy'])

To train a Keras model with generators works a little bit differently from training with data sets. Instead of calling fit on your model, you now need to call fit_generator, and you also replace evaluate with evaluate_generator. Moreover, the signatures of these methods are slightly different from what you’ve seen before. Using fit_generator works by specifying a generator, the number of epochs, and the number of training steps per epoch, which you provide with steps_per_epoch. These three arguments provide the bare minimum to train a model. You also want to validate your training process on test data. For this, you provide validation_data with your test data generator and specify the number of validation steps per epoch as validation_steps. Lastly, you add a callback to your model. Callbacks allow you to track and return additional information during the training process. You use callbacks here to hook in the ModelCheckpoint utility to store the Keras model after each epoch. As an example, you train a model for five epochs on a batch size of 128.

Listing 7.22. Fitting and evaluating Keras models with generators
epochs = 5
batch_size = 128
model.fit_generator(
  generator=generator.generate(batch_size, num_classes),            1
  epochs=epochs,
  steps_per_epoch=generator.get_num_samples() / batch_size,         2
  validation_data=test_generator.generate(
    batch_size, num_classes),                                       3
  validation_steps=test_generator.get_num_samples() / batch_size,   4
  callbacks=[
    ModelCheckpoint('../checkpoints/small_model_epoch_{epoch}.h5')
  ])                                                                5
model.evaluate_generator(
  generator=test_generator.generate(batch_size, num_classes),
  steps=test_generator.get_num_samples() / batch_size)              6

  • 1 You specify a training data generator for your batch size...
  • 2 ...and the number of training steps per epoch you execute.
  • 3 An additional generator is used for validation...
  • 4 ...which also needs a number of steps.
  • 5 After each epoch, you persist a checkpoint of the model.
  • 6 For evaluation, you also specify a generator and the number of steps.

Note that if you run this code yourself, you should be aware of the time it may take to complete this experiment. If you run this on a CPU, training an epoch might take a few hours. As it happens, the math used in machine learning has a lot in common with the math used in computer graphics. So in some cases, you can move your neural network computation onto your GPU and get a big speedup. Using a GPU for computation will massively speed up computation, usually by one or two orders of magnitude for convolutional neural networks. TensorFlow has extensive support for moving computation onto certain GPUs, if your machine has suitable drivers available.

Note

If you want to use a GPU for machine learning, an NVIDIA chip with a Windows or Linux OS is the best-supported combination. Other combinations are possible, but you may spend a lot of time fiddling with drivers.

In case you don’t want to try this yourself, or just don’t want to do this right now, we’ve precomputed this model for you. Have a look at our GitHub repository to see the five checkpoint models stored in checkpoints, one for each completed epoch. Here’s the output of that training run (computed on an old CPU on a laptop, to encourage you to get a fast GPU right away):

Epoch 1/5
12288/12288 [==============================] - 14053s 1s/step - loss: 3.5514
 - acc: 0.2834 - val_loss: 2.5023 - val_acc: 0.6669
Epoch 2/5
12288/12288 [==============================] - 15808s 1s/step - loss: 0.3028
 - acc: 0.9174 - val_loss: 2.2127 - val_acc: 0.8294
Epoch 3/5
12288/12288 [==============================] - 14410s 1s/step - loss: 0.0840
 - acc: 0.9791 - val_loss: 2.2512 - val_acc: 0.8413
Epoch 4/5
12288/12288 [==============================] - 14620s 1s/step - loss: 0.1113
 - acc: 0.9832 - val_loss: 2.2832 - val_acc: 0.8415
Epoch 5/5
12288/12288 [==============================] - 18688s 2s/step - loss: 0.1647
 - acc: 0.9816 - val_loss: 2.2928 - val_acc: 0.8461

As you can see, after three epochs, you’ve reached 98% accuracy on training and 84% on test data. This is a massive improvement over the models you computed in chapter 6! It seems that training a larger network on real data paid off: your network learned to predict moves from 100 games almost perfectly, but generalizes reasonably well. You can be more than happy with the 84% validation accuracy. On the other hand, 100 games’ worth of moves is still a tiny data set, and you don’t know yet how well you’d do on a much larger corpus of games. After all, your goal is to build a strong Go bot that can compete with strong opponents, not to crush a toy data set.

To build a really strong opponent, you need to work with better Go data encoders next. Your one-plane encoder from chapter 6 is a good first guess, but it doesn’t capture the complexity that you’re dealing with. In section 7.4, you’ll learn about two more-sophisticated encoders that will boost your training performance.

7.4. Building more-realistic Go data encoders

Chapters 2 and 3 covered the ko rule in Go quite a bit. Recall that this rule exists to prevent infinite loops in games: you can’t play a stone that leads to a situation previously on the board. If we give you a random Go board position and you have to decide whether there’s a ko going on, you’d have to guess. There’s no way of knowing without having seen the sequence leading up to that position. In particular, your one-plane encoder, which encoded black stones as –1, white ones as 1, and empty positions as 0, can’t possibly learn anything about ko. This is just one example, but it goes to show that the OnePlaneEncoder you built in chapter 6 is a little too simplistic to capture everything you need to build a strong Go bot.

In this section, we’ll provide you with two more elaborate encoders that led to relatively strong move-prediction performance in the literature. The first one we call SevenPlaneEncoder, which consists of the following seven feature planes. Each plane is a 19 × 19 matrix and describes a different set of features:

  • The first plane has a 1 for every white stone that has precisely one liberty, and 0s otherwise.
  • The second and third feature planes have a 1 for white stones with two or at least three liberties, respectively.
  • The fourth to sixth planes do the same for black stones; they encode black stones with one, two, or at least three liberties.
  • The last feature plane marks points that can’t be played because of ko with a 1.

Apart from explicitly encoding the concepts of ko, with this set of features you also model liberties and distinguish between black and white stones. Stones with just one liberty have extra tactical significance, because they’re at risk of getting captured on the next turn. (Go players say that a stone with just one liberty is in atari.) Because the model can “see” this property directly, it’s easier for it to pick up on how that affects game play. By creating planes for concepts such as ko and the number of liberties, you give a hint to the model that these concepts are important, without having to explain how or why they’re important.

Let’s see how you can implement this by extending the base Encoder from the encoders module. Save the following code in sevenplane.py.

Listing 7.23. Initializing a simple seven-plane encoder
import numpy as np

from dlgo.encoders.base import Encoder
from dlgo.goboard import Move, Point


class SevenPlaneEncoder(Encoder):
    def __init__(self, board_size):
        self.board_width, self.board_height = board_size
        self.num_planes = 7

    def name(self):
        return 'sevenplane'

The interesting part is the encoding of the board position, which is done as follows.

Listing 7.24. Encoding game state with a SevenPlaneEncoder
    def encode(self, game_state):
        board_tensor = np.zeros(self.shape())
        base_plane = {game_state.next_player: 0,
                      game_state.next_player.other: 3}
        for row in range(self.board_height):
            for col in range(self.board_width):
                p = Point(row=row + 1, col=col + 1)
                go_string = game_state.board.get_go_string(p)
                if go_string is None:
                    if
     game_state.does_move_violate_ko(game_state.next_player,
                                                       Move.play(p)):
                        board_tensor[6][row][col] = 1                  1
                else:
                    liberty_plane = min(3, go_string.num_liberties) - 1
                    liberty_plane += base_plane[go_string.color]
                    board_tensor[liberty_plane][row][col] = 1          2
        return board_tensor

  • 1 Encoding moves prohibited by the ko rule
  • 2 Encoding black and white stones with 1, 2, or more liberties

To finish this definition, you also need to implement a few convenience methods, to suffice the Encoder interface.

Listing 7.25. Implementing all other Encoder methods for your seven-plane encoder
    def encode_point(self, point):
        return self.board_width * (point.row - 1) + (point.col - 1)

    def decode_point_index(self, index):
        row = index // self.board_width
        col = index % self.board_width
        return Point(row=row + 1, col=col + 1)

    def num_points(self):
        return self.board_width * self.board_height

    def shape(self):
        return self.num_planes, self.board_height, self.board_width


def create(board_size):
    return SevenPlaneEncoder(board_size)

Another encoder that we’ll discuss here, and point you to the code in GitHub, is an encoder with 11 feature planes that’s similar to SevenPlaneEncoder. In this encoder, called SimpleEncoder, which you can find under simple.py in the encoders module in GitHub, you use the following feature planes:

  • The first four feature planes describe black stones with one, two, three, or four liberties.
  • The second four planes describe white stones with one, two, three, or four liberties.
  • The ninth plane is set to 1 if it’s black’s turn, and the tenth if it’s white’s.
  • The last feature plane is again reserved for indicating ko.

This encoder with 11 planes is close to the last one, but is more explicit about whose turn it is and more specific about the number of liberties a stone has. Both are great encoders that will lead to notable improvements in model performance.

Throughout chapters 5 and 6, you learned about many techniques that improve your deep-learning models, but one ingredient remained the same for all experiments: you used stochastic gradient descent as the optimizer. Although SGD provides a great baseline, in the next section we’ll teach you about Adagrad and Adadelta, two optimizers that your training process will greatly benefit from.

7.5. Training efficiently with adaptive gradients

To further improve performance of your Go move-prediction models, we’ll introduce one last set of tools in this chapter—optimizers other than stochastic gradient descent. Recall from chapter 5 that SGD has a fairly simplistic update rule. If for a parameter W you receive a backpropagation error of ΔW and you have a learning rate of α specified, then updating this parameter with SGD simply means computing W – αΔW.

In many cases, this update rule can lead to good results, but a few drawbacks exist as well. To address them, you can use many excellent extensions to plain SGD.

7.5.1. Decay and momentum in SGD

For instance, a widely used idea is to let the learning rate decay over time; with every update step you take, the learning rate becomes smaller. This technique usually works well, because in the beginning your network hasn’t learned anything yet, and big update steps might make sense to get closer to a minimum of the loss function. But after the training process has reached a certain level, you should make your updates smaller and make only appropriate refinements to the learning process that don’t spoil progress. Usually, you specify learning rate decay by a decay rate, a percentage by which you’ll decrease the next step.

Another popular technique is that of momentum, in which a fraction of the last update step is added to the current one. For instance, if W is a parameter vector that you want to update, ]W is the current gradient computed for W, and if the last update you used was U, then the next update step will be as follows:

WW – α(γU + (1 + γ)∂W)

This fraction g you keep from the last update is called the momentum term. If both gradient terms point in roughly the same direction, your next update step gets reinforced (receives momentum). If the gradients point in opposite directions, they cancel each other out and the gradient gets dampened. The technique is called momentum because of the similarity of the physical concept by the same name. Think about your loss function as a surface and the parameters lying on that surface as a ball. Then a parameter update describes movement of the ball. Because you’re doing gradient descent, you can even think of this as a ball rolling down the surface, by receiving movements one by one. If the last few (gradient) steps all point in the same general direction, the ball will pick up speed and reach its destination, the minimum of the surface, quicker. The momentum technique exploits this analogy.

If you want to use decay, momentum, or both in SGD with Keras, it’s as simple as providing the respective rates to an SGD instance. Let’s say you want SGD with a learning rate of 0.1, a 1% decay rate, and 90% momentum; you’d do the following.

Listing 7.26. Initializing SGD in Keras with momentum and learning rate decay
from keras.optimizers import SGD
sgd = SGD(lr=0.1, momentum=0.9, decay=0.01)

7.5.2. Optimizing neural networks with Adagrad

Both learning rate decay and momentum do a good job at refining plain SGD, but a few weaknesses still remain. For instance, if you think about the Go board, professionals will almost exclusively play their first few moves on the third to fifth lines of the board, but, without exception, never on the first or second. In the endgame, the situation is somewhat reversed, in that many of the last moves happen at the border of the board. In all deep-learning models you worked with so far, the last layer was a dense layer of board size (here 19 × 19). Each neuron of this layer corresponds to a position on the board. If you use SGD, with or without momentum or decay, the same learning rate is used for each of these neurons. This can be dangerous. Maybe you did a poor job at shuffling the training data, and the learning rate has decayed so much that endgame moves on the first and second line don’t get any significant updates anymore—meaning, no learning. In general, you want to make sure that infrequently observed patterns still get large-enough updates, while frequent patterns receive smaller and smaller updates.

To address the problem caused by setting global learning rates, you can use techniques using adaptive gradient methods. We’ll show you two of these methods: Adagrad and Adadelta.

In Adagrad, there’s no global learning rate. You adapt the learning rate per parameter. Adagrad works pretty well when you have a lot of data and patterns in the data can be found only rarely. Both of these criteria apply to our situation: you have a lot of data, and professional Go game play is so complex that certain move combinations occur infrequently in your data set, although they’re considered standard play by professionals.

Let’s say you have a weight vector W of length l (it’s easier to think of vectors here, but this technique applies more generally to tensors as well) with individual entries Wi. For a given gradient ]W for these parameters, in plain SGD with a learning rate of a, the update rule for each Wi is as follows:

WiWi – α∂Wi

In Adagrad, you replace α with a term that adapts dynamically for each index i by looking at how much you’ve updated Wi in the past. In fact, in Adagrad the individual learning rate will be inversely proportional to the previous updates. To be more precise, in Adagrad, you update parameters as follows:

In this equation, e is a small positive value to ensure you’re not dividing by 0, and Gi,i is the sum of squared gradients Wi received until this point. We write this as Gi,i because you can view this term as part of a square matrix G of length l in which all diagonal entries Gj,j have the form we just described and all off-diagonal terms are 0. A matrix of this form is called a diagonal matrix. You update G after each parameter update, by adding the latest gradient contributions to the diagonal elements. That’s all there is to defining Adagrad, but if you want to write this update rule in a concise form independent of the index i, this is the way to do it:

Note that because G is a matrix, you need to add e to each entry Gi,j and divide α by each such entry. Moreover, by G·]W you mean matrix multiplication of G with ]W. To use Adagrad with Keras, compiling a model with this optimizer works as follows.

Listing 7.27. Using the Adagrad optimizer for Keras models
from keras.optimizers import Adagrad
adagrad = Adagrad()

A key benefit of Adagrad over other SGD techniques is that you don’t have to manually set the learning rate—one thing less to worry about. It’s hard enough already to find a good network architecture and tune all the parameters for the model. In fact, you could alter the initial learning rate in Keras by using Adagrad(lr=0.02), but it’s not recommended to do so.

7.5.3. Refining adaptive gradients with Adadelta

An optimizer that’s similar to Adagrad and is an extension of it is Adadelta. In this optimizer, instead of accumulating all past (squares of) gradients in G, you use the same idea we’ve shown you in the momentum technique and keep only a fraction of the last update and add the current gradient to it:

G ← γG + (1 – γ)∂W

Although this idea is roughly what happens in Adadelta, the details that make this optimizer work and that lead to its precise update rule are a little too intricate to present here. We recommend that you look into the original paper for more details (https://arxiv.org/abs/1212.5701).

In Keras, you use the Adadelta optimizer as follows.

Listing 7.28. Using the Adadelta optimizer for Keras models
from keras.optimizers import Adadelta
adadelta = Adadelta()

Both Adagrad and Adadelta are hugely beneficial to training deep neural networks on Go data, as compared to stochastic gradient descent. In later chapters, you’ll often use one or the other as an optimizer in more-advanced models.

7.6. Running your own experiments and evaluating performance

Throughout chapters 5, 6, and this one, we’ve shown you many deep-learning techniques. We gave you some hints and sample architectures that made sense as a baseline, but now it’s time to train your own models. In machine-learning experiments, it’s crucial to try various combinations of hyperparameters, such as the number of layers, which layers to choose, how many epochs to train for, and so on. In particular, with deep neural networks, the number of choices you face can be overwhelming. It’s not always as clear how tweaking a specific knob impacts model performance. Deep-learning researchers can rely on a large corpus of experimental results and further theoretical arguments from decades of research to back their intuition. We can’t provide you with that deep a level of knowledge here, but we can help get you started building intuition of your own.

A crucial factor in achieving strong results in experimental setups such as ours—namely, training a neural network to predict Go moves as well as possible—is a fast experimentation cycle. The time it takes you to build a model architecture, start model training, observe and evaluate performance metrics, and then go back to adjust your model and start the process anew has to be short. When you look at data science challenges such as those hosted on kaggle.com, it’s often the teams who tried the most that win. Luckily for you, Keras was built with fast experimentation in mind. It’s also one of the prime reasons we chose it as deep-learning framework for this book. We hope you agree that you can build neural networks with Keras quickly and that changing your experimental setup comes naturally.

7.6.1. A guideline to testing architectures and hyperparameters

Let’s have a look at a few practical considerations when building a move-prediction network:

  • Convolutional neural networks are a good candidate for Go move-prediction networks. Make sure to convince yourself that working with only dense layers will result in inferior prediction quality. Building a network that consists of several convolutional layers and one or two dense layers at the end is usually a must. In later chapters, you’ll see more-complex architectures, but for now, work with convolutional networks.
  • In your convolutional layers, vary the kernel sizes to see how this change influences model performance. As a rule of thumb, kernel sizes between 2 and 7 are suitable, and you shouldn’t go much larger than that.
  • If you use pooling layers, make sure to experiment with both max and average pooling, but more important, don’t choose a too large pooling size. A practical upper bound might be 3 in your situation. You may also want to try building networks without pooling layers, which might be computationally more expensive, but can work pretty well.
  • Use dropout layers for regularization. In chapter 6, you saw how dropout can be used to prevent your model from overfitting. Your networks will generally benefit from adding in dropout layers, as long as you don’t use too many of them and don’t set the dropout rate too high.
  • Use softmax activation in your last layer for its benefit of producing probability distributions and use it in combination with categorical cross-entropy loss, which suits your situation very well.
  • Experiment with different activation functions. We’ve introduced you to ReLU, which should act as your default choice for now, and sigmoid activations. You can use plenty of other activation functions in Keras, such as elu, selu, PReLU, and LeakyReLU. We can’t discuss these ReLU variants here, but their usage is well described at https://keras.io/activations/.
  • Varying mini-batch size has an impact on model performance. In prediction problems such as MNIST from chapter 5, it’s usually recommended to choose mini-batches in the same order of magnitude as the number of classes. For MNIST, you often see mini-batch sizes ranging from 10 to 50. If data is perfectly randomized, this way, each gradient will receive information from each class, which makes SGD generally perform better. In our use case, some Go moves are played much more often than others. For instance, the four corners of the board are rarely played, especially compared with the star points. We call this a class imbalance in our data. In this case, you can’t expect a mini-batch to cover all the classes, and should work with mini-batch sizes ranging from 16 to 256 (which is what you find in the literature). The choice of optimizer also has a considerable impact on how well your network learns. SGD with or without learning rate decay, as well as Adagrad and Adadelta, already give you options to experiment with. Under https://keras.io/optimizers/ you’ll find other optimizers that your model training process might benefit from. The number of epochs used to train a model has to be chosen appropriately. If you use model checkpointing and track various performance metrics per epoch, you can effectively measure when training stops improving. In the next and final section of this chapter, we briefly discuss how to evaluate performance metrics. As a general rule of thumb, given enough compute power, set the number of epochs too high rather than too low. If model training stops improving or even gets worse through overfitting, you can still take an earlier checkpoint model for your bot.
Weight initializers

Another crucial aspect for tuning deep neural networks is how to initialize the weights before training starts. Because optimizing a network means finding a set of weights corresponding to a minimum on the loss surface, the weights you start with are important. In your network implementation from chapter 5, you randomly assigned initial weights, which is generally a bad idea.

Weight initializations are an interesting topic of research and almost deserve a chapter of their own. Keras has many weight initialization schemes, and each layer with weights can be initialized accordingly. The reason you don’t cover them in the main text is that the initializers Keras chooses by default are usually so good that it’s not worth bothering to change them. Usually, it’s other aspects of your network definition that require attention. But it’s good to know that there are differences, and advanced users might want to experiment with Keras initializers, found at https://keras.io/initializers/, as well.

7.6.2. Evaluating performance metrics for training and test data

In section 7.3, we showed you results of a training run performed on a small data set. The network we used was a relatively small convolutional network, and we trained this network for five epochs. In this experiment, we tracked loss and accuracy on training data and used test data for validation. At the end, we computed accuracy on test data. That’s the general workflow you should follow, but how do you judge when to stop training or detect when something is off? Here are a few guidelines:

  • Your training accuracy and loss should generally improve for each epoch. In later epochs, these metrics will taper off and sometimes fluctuate a little. If you don’t see any improvement for a few epochs, you might want to stop.
  • At the same time, you should see what your validation loss and accuracy look like. In early epochs, validation loss will drop consistently, but in later epochs, what you often see is that it plateaus and often starts to increase again. This is a sure sign that the network starts to overfit on the training data.
  • If you use model checkpointing, pick the model from the epoch with high training accuracy that still has a low validation error.
  • If both training and validation loss are high, try to choose a deeper network architecture or other hyperparameters.
  • In case your training error is low, but validation error high, your model is overfitting. This scenario usually doesn’t occur when you have a truly large training data set. With more than 170,000 Go games and many million moves to learn from, you should be fine.
  • Choose a training data size that makes sense for your hardware requirements. If training an epoch takes more than a few hours, it’s just not that much fun. Instead, try to find a well-performing model among many tries on a medium-sized data set and then train this model once again on the largest data set possible.
  • If you don’t have a good GPU, you might want to opt for training your model in the cloud. In appendix D, we’ll show you how to train a model on a GPU using Amazon Web Services (AWS).
  • When comparing runs, don’t stop a run that looks worse than a previous run too early. Some learning processes are slower than others—and might eventually catch up or even outperform other models.

You might ask yourself how strong a bot you can potentially build with the methods presented in this chapter. A theoretical upper bound is this: the network can never get better at playing Go than the data you feed it. In particular, using just supervised deep-learning techniques, as you did in the last three chapters, won’t surpass human game play. In practice, with enough compute power and time, it’s definitely possible to reach results up to about 2 dan level.

To reach super-human performance of game play, you need to work with reinforcement-learning techniques, introduced in chapters 9 to 12. Afterward, you can combine tree search from chapter 4, reinforcement learning, and supervised deep learning to build even stronger bots in chapters 13 and 14.

But before you go deeper into the methodology of building stronger bots, in the next chapter we’ll show you how to deploy a bot and let it interact with its environment by playing against either human opponents or other bots.

7.7. Summary

  • The ubiquitous Smart Game Format (SGF) for Go and other game records is useful to build data for neural networks.
  • Go data can be processed in parallel for speed and efficiently represented as generators.
  • With strong amateur-to-professional game records, you can build deep-learning models that predict Go moves quite well.
  • If you know certain properties of your training data that are important, you can explicitly encode them in feature planes. Then the model can quickly learn connections between the feature planes and the results you’re trying to predict. For a Go bot, you can add feature planes that represent concepts such as the number of liberties (adjacent empty points) a string of stones has.
  • You can train more efficiently by using adaptive gradient techniques such as Adagrad or Adadelta. These algorithms adjust the learning rate on the fly as training progresses.
  • End-to-end model training can be achieved in a relatively small script that you can use as a template for your own experiments.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset