Chapter 6. Designing a neural network for Go data

This chapter covers

  • Building a deep-learning application that can predict the next Go move from data
  • Introducing the Keras deep-learning framework
  • Understanding convolutional neural networks
  • Building neural networks to analyze spatial Go data

In the preceding chapter, you saw the fundamental principles of neural networks in action and implemented feed-forward networks from scratch. In this chapter, you’ll turn your attention back to the game of Go and tackle the problem of how to use deep-learning techniques to predict the next move for any given board situation of a Go game. In particular, you’ll generate Go game data with tree-search techniques developed in chapter 4 that you can then use to train a neural network. Figure 6.1 gives an overview of the application you’re going to build in this chapter.

Figure 6.1. How to predict the next move in a game of Go by using deep learning

As figure 6.1 illustrates, to connect your working knowledge of neural networks from the preceding chapter, you have to address a few critical steps first:

  1. In chapter 3, you focused on teaching a machine the rules of Go by implementing game play on a Go board. Chapter 4 used these structures for tree search. But in chapter 5, you saw that neural networks need numerical input; for the feed-forward architecture you implemented, vectors are required.
  2. To transform a Go board position into an input vector to be fed into a neural network, you have to create an encoder to do the job. In figure 6.1, we sketched a simple encoder that you’ll implement in section 6.1; the board is encoded as a matrix of board size, white stones are represented as –1, black stones as 1, and empty points as 0. This matrix can be flattened to a vector, just as you did with MNIST data in the preceding chapter. Although this representation is a little too simple to provide excellent results for move prediction, it’s a first step in the right direction. In chapter 7, you’ll see more-sophisticated and useful ways to encode the board.
  3. To train a neural network to predict moves, you first have to get your hands on data to feed into it. In section 6.2, you’ll pick up the techniques from chapter 4 to generate game records. You’ll encode each board position as just discussed, which will serve as your features, and store the next move for each position as labels.
  4. Although it’s useful to have implemented a neural network as you did in chapter 5, it’s now equally important to gain more speed and reliability by introducing a more mature deep-learning library. To this end, section 6.3 introduces Keras, a popular deep-learning library written in Python. You’ll use Keras to model a network for move prediction.
  5. At this point, you might be wondering why you completely discard the spatial structure of the Go board by flattening the encoded board to a vector. In section 6.4, you’ll learn about a new layer type called a convolutional layer that’s much better suited for your use case. You’ll use these layers to build a new architecture called a convolutional neural network.
  6. Toward the end of the chapter, you’ll get to know more key concepts of modern deep learning that will further increase move-prediction accuracy, such as efficiently predicting probabilities with softmax in section 6.5 or building deeper neural networks in section 6.6 with an interesting activation function called a rectified linear unit (ReLU).

6.1. Encoding a Go game position for neural networks

In chapter 3, you built a library of Python classes that represented all the entities in a game of Go: Player, Board, GameState, and so on. Now you want to apply machine learning to problems in Go. But mathematical models like neural networks can’t operate on high-level objects like our GameState class; they can deal with only mathematical objects, such as vectors and matrices. In this section, you’ll create an Encoder class that translates your native game objects to a mathematical form. Throughout the rest of the chapter, you can feed that mathematical representation to your machine-learning tools.

The first step toward building a deep-learning model for Go move prediction is to load data that can be fed into a neural network. You do this by defining a simple encoder for the Go board, introduced in figure 6.1. An encoder is a way to transform the Go board you implemented in chapter 3 in a suitable way. The neural networks you’ve learned about to this point, multilayer perceptrons, take vectors as inputs, but in section 6.4 you’ll see another network architecture that operates on higher-dimensional data. Figure 6.2 gives you an idea how such an encoder could be defined.

Figure 6.2. An illustration of the Encoder class. It takes your GameState class and translates it into a mathematical form—a NumPy array.

At its core, an encoder has to know how to encode a full Go game state. In particular, it should define how to encode a single point on the board. Sometimes the inverse is also interesting: if you’ve predicted the next move with a network, that move will be encoded, and you need to translate it back to an actual move on the board. This operation, called decoding, is integral to applying predicted moves.

With this in mind, you can now define your Encoder class, an interface for the encoders that you’ll create in this and the next chapter. You’ll define a new module in dlgo called encoders, which you’ll initialize with an empty __init__.py, and put the file base.py in it. Then you’ll put the following definition in that file.

Listing 6.1. Abstract Encoder class to encode Go game state
class Encoder:
    def name(self):                      1
        raise NotImplementedError()

    def encode(self, game_state):        2
        raise NotImplementedError()

    def encode_point(self, point):       3
        raise NotImplementedError()

    def decode_point_index(self, index): 4
        raise NotImplementedError()

    def num_points(self):                5
        raise NotImplementedError()

    def shape(self):                     6
        raise NotImplementedError()

  • 1 Lets you support logging or saving the name of the encoder your model is using
  • 2 Turns a Go board into numeric data
  • 3 Turns a Go board point into an integer index
  • 4 Turns an integer index back into a Go board point
  • 5 Number of points on the board—board width times board height
  • 6 Shape of the encoded board structure

The definition of encoders is straightforward, but we want to add one more convenience feature into base.py: a function to create an encoder by its name, a string, instead of creating an object explicitly. You do this with the get_encoder_by_name function that you append to the definition of encoders.

Listing 6.2. Referencing Go board encoders by name
import importlib


def get_encoder_by_name(name, board_size):                      1
    if isinstance(board_size, int):
        board_size = (board_size, board_size)                   2
    module = importlib.import_module('dlgo.encoders.' + name)
    constructor = getattr(module, 'create')                     3
    return constructor(board_size)

  • 1 You can create encoder instances by referencing their name.
  • 2 If board_size is one integer, you create a square board from it.
  • 3 Each encoder implementation will have to provide a “create” function that provides an instance.

Now that you know what an encoder is and how to build one, let’s implement the idea from figure 6.2 as your first encoder: one color is represented as 1, the other as –1, and empty points as 0. To make accurate predictions, the model also needs to know whose turn it is. So instead of using 1 for black and –1 for white, you’ll use 1 for whoever has the next turn, and –1 for the opponent. You’ll call this OnePlaneEncoder, because you encode the Go board into a single matrix or plane of the same size as the board. In chapter 7, you’ll see encoders with more feature planes; for instance, you’ll implement an encoder that has one plane each for black and white stones, and one plane to capture ko. Right now, you’ll stick with our simple one-plane encoding idea that you implement in oneplane.py in the encoders module. The following listing shows the first part.

Listing 6.3. Encoding game state with a simple one-plane Go board encoder
import numpy as np

from dlgo.encoders.base import Encoder
from dlgo.goboard import Point


class OnePlaneEncoder(Encoder):
    def __init__(self, board_size):
        self.board_width, self.board_height = board_size
        self.num_planes = 1

    def name(self):                             1
        return 'oneplane'

    def encode(self, game_state):               2
        board_matrix = np.zeros(self.shape())
        next_player = game_state.next_player
        for r in range(self.board_height):
            for c in range(self.board_width):
                p = Point(row=r + 1, col=c + 1)
                go_string = game_state.board.get_go_string(p)
                if go_string is None:
                    continue
                if go_string.color == next_player:
                    board_matrix[0, r, c] = 1
                else:
                    board_matrix[0, r, c] = -1
        return board_matrix

  • 1 You can reference this encoder by the name oneplane.
  • 2 To encode, you fill a matrix with 1 if the point contains one of the current player’s stones, –1 if the point contains the opponent’s stones, and 0 if the point is empty.

In the second part of the definition, you’ll take care of encoding and decoding single points of the board. The encoding is done by mapping a point on the board to a vector that has a length of board width times board height; the decoding recovers point coordinates from such a vector.

Listing 6.4. Encoding and decoding points with your one-plane Go board encoder
    def encode_point(self, point):                                   1
        return self.board_width * (point.row - 1) + (point.col - 1)

    def decode_point_index(self, index):                             2
        row = index // self.board_width
        col = index % self.board_width
        return Point(row=row + 1, col=col + 1)

    def num_points(self):
        return self.board_width * self.board_height

    def shape(self):
        return self.num_planes, self.board_height, self.board_width

  • 1 Turns a board point into an integer index
  • 2 Turns an integer index into a board point

This concludes our section on Go board encoders. You can now move on to create data that you can encode and feed into a neural network.

6.2. Generating tree-search games as network training data

Before you can apply machine learning to Go games, you need a set of training data. Fortunately, strong players are playing on public Go servers all the time. Chapter 7 covers how to find and process those game records to create training data. For now, you can generate your own game records. This section shows how to use the tree-search bots you created in chapter 4 to generate game records. In the rest of the chapter, you can use those bot game records as training data to experiment with deep learning.

Does it seem silly to use machine learning to imitate a classical algorithm? Not if the traditional algorithm is slow! Here you hope to use machine learning to get a fast approximation to a slow tree search. This concept is a key part of AlphaGo Zero, the strongest version of AlphaGo. Chapter 14 covers how AlphaGo Zero works.

Go ahead and create a file called generate_mcts_games.py outside the dlgo module. As the filename suggests, you’ll write code that generates games with MCTS. Each move in each of these games will then be encoded with your OnePlaneEncoder from section 6.1 and stored in numpy arrays for future use. To begin with, put the following import statements at the top of generate_mcts_games.py.

Listing 6.5. Imports for generating encoded Monte Carlo tree-search game data
import argparse
import numpy as np

from dlgo.encoders import get_encoder_by_name
from dlgo import goboard_fast as goboard
from dlgo import mcts
from dlgo.utils import print_board, print_move

From these imports, you can already see which tools you’ll use for the job: the mcts module, your goboard implementation from chapter 3, and the encoders module you just defined. Let’s move on to creating the function that’ll generate the game data for you. In generate_game, you let an instance of an MCTSAgent from chapter 4 play games against itself (recall from chapter 4 that the temperature of an MCTS agent regulates the volatility of your tree search). For each move, you encode the board state before the move has been played, encode the move as a one-hot vector, and then apply the move to the board.

Listing 6.6. Generating MCTS games for this chapter
def generate_game(board_size, rounds, max_moves, temperature):
    boards, moves = [], []                                      1

    encoder = get_encoder_by_name('oneplane', board_size)       2

    game = goboard.GameState.new_game(board_size)               3

    bot = mcts.MCTSAgent(rounds, temperature)                   4

    num_moves = 0
    while not game.is_over():
        print_board(game.board)
        move = bot.select_move(game)                            5
        if move.is_play:
            boards.append(encoder.encode(game))                 6

            move_one_hot = np.zeros(encoder.num_points())
            move_one_hot[encoder.encode_point(move.point)] = 1
            moves.append(move_one_hot)                          7

        print_move(game.next_player, move)
        game = game.apply_move(move)                            8
        num_moves += 1
        if num_moves > max_moves:                               9
            break

    return np.array(boards), np.array(moves)

  • 1 In boards you store encoded board state; moves is for encoded moves.
  • 2 Initialize a OnePlaneEncoder by name with given board size.
  • 3 A new game of size board_size is instantiated.
  • 4 A Monte Carlo tree-search agent with specified number of rounds and temperature will serve as your bot.
  • 5 The next move is selected by the bot.
  • 6 The encoded board situation is appended to boards.
  • 7 The one-hot-encoded next move is appended to moves.
  • 8 Afterward, the bot move is applied to the board.
  • 9 You continue with the next move, unless the maximum number of moves has been reached.

Now that you have the means to create and encode game data with Monte Carlo tree search, you can define a main method to run a few games and persist them afterward, which you can also put into generate_mcts_games.py.

Listing 6.7. Main application for generating MCTS games for this chapter
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--board-size', '-b', type=int, default=9)
    parser.add_argument('--rounds', '-r', type=int, default=1000)
    parser.add_argument('--temperature', '-t', type=float, default=0.8)
    parser.add_argument('--max-moves', '-m', type=int, default=60,
                        help='Max moves per game.')
    parser.add_argument('--num-games', '-n', type=int, default=10)
    parser.add_argument('--board-out')
    parser.add_argument('--move-out')

    args = parser.parse_args()              1
    xs = []
    ys = []

    for i in range(args.num_games):
        print('Generating game %d/%d...' % (i + 1, args.num_games))
        x, y = generate_game(args.board_size, args.rounds, args.max_moves,
 args.temperature)                        2
        xs.append(x)
        ys.append(y)

    x = np.concatenate(xs)                  3
    y = np.concatenate(ys)

    np.save(args.board_out, x)              4
    np.save(args.move_out, y)


if __name__ == '__main__':
    main()

  • 1 This application allows customization via command-line arguments.
  • 2 For the specified number of games, you generate game data.
  • 3 After all games have been generated, you concatenate features and labels, respectively.
  • 4 You store feature and label data to separate files, as specified by the command-line options.

Using this tool, you can now generate game data easily. Let’s say you want to create data for twenty 9 × 9 Go games and store features in features.npy, and labels in labels.npy. The following command will do it:

python generate_mcts_games.py -n 20 --board-out features.npy
 --move-out labels.npy

Note that generating games like this can be fairly slow, so generating a lot of games will take a while. You could always decrease the number of rounds for MCTS, but this also decreases the bot’s level of play. Therefore, we generated game data for you already that you can find in the GitHub repo under generated_games. You can find the output in features-40k.npy and labels-40k.npy; it contains about 40,000 moves over several hundred games. We generated these with 5,000 MCTS rounds per move. At that setting, the MCTS engine mostly plays sensible moves, so we can reasonably hope that a neural network can learn to imitate it.

At this point, you’ve done all the preprocessing you need in order to apply a neural network to your generated data. You could do this with your network implementation from chapter 5 in a straightforward manner—and it’s a good exercise to do so—but going forward, you need a more powerful tool to satisfy your needs to work with increasingly complex deep neural networks. To this end, we introduce Keras next.

6.3. Using the Keras deep-learning library

Computing gradients and the backward pass of a neural network is becoming more and more of a lost art form because of the emergence of many powerful deep-learning libraries that hide lower-level abstractions. It’s good to have implemented neural networks from scratch in the previous chapter, but now it’s time to move on to more mature and feature-rich software.

The Keras deep-learning library is a particularly elegant and popular deep-learning tool written in Python. The open source project was created in 2015 and quickly accumulated a strong user base. The code is hosted at https://github.com/keras-team/keras and has excellent documentation that can be found at https://keras.io.

6.3.1. Understanding Keras design principles

One of the strong suits of Keras is that it’s an intuitive and easy-to-pick-up API that allows for quick prototyping and a fast experimentation cycle. This makes Keras a popular pick in many data science challenges, such as on https://kaggle.com. Keras is built from modular building blocks and was originally inspired by other deep-learning tools such as Torch. Another big plus for Keras is its extensibility. Adding new custom layers or augmenting existing functionality is relatively straightforward.

Another aspect that makes Keras easy to get started with is that it comes with batteries included. For instance, many popular data sets, like MNIST, can be loaded directly with Keras, and you can find a lot of good examples in the GitHub repository. On top of that, there’s a whole community-built ecosystem of Keras extensions and independent projects at https://github.com/fchollet/keras-resources.

A distinctive feature of Keras is the concept of backends: it runs with powerful engines that can be swapped on demand. One way to think of Keras is as a deep-learning frontend, a library that provides a convenient set of high-level abstractions and functionality to run your models, but is backed by a choice of backend that does the heavy lifting in the background. As of the writing of this book, three official backends are available for Keras: TensorFlow, Theano, and the Microsoft Cognitive Toolkit. In this book, you’ll work with Google’s TensorFlow library exclusively, which is also the default backend used by Keras. But if you prefer another backend, you shouldn’t need much effort to switch; Keras handles most of the differences for you.

In this section, you’ll first install Keras. Then you’ll learn about its API by running the handwritten digit classification example from chapter 5 with it, and then move on to the task of Go move prediction.

6.3.2. Installing the Keras deep-learning library

To get started with Keras, you need to install a backend first. You can start with TensorFlow, which is easiest installed through pip by running the following:

pip install tensorflow

If your machine has an NVIDIA GPU and current CUDA drivers installed, you can try installing the GPU-accelerated version of TensorFlow instead:

pip install tensorflow-gpu

If tensorflow-gpu is compatible with your hardware and drivers, that’ll give you a huge speed improvement.

A few optional dependencies that are helpful for model serialization and visualization can be installed for Keras, but you’ll skip them for now and directly proceed to installing the library itself:

pip install Keras

6.3.3. Running a familiar first example with Keras

In this section, you’ll see that defining and running Keras models follows a four-step workflow:

  1. Data preprocessingLoad and prepare a data set to be fed into a neural network.
  2. Model definitionInstantiate a model and add layers to it as needed.
  3. Model compilationCompile your previously defined model with an optimizer, a loss function, and an optional list of evaluation metrics.
  4. Model training and evaluationFit your deep-learning model to data and evaluate it.

To get started with Keras, we walk you through an example use case that you encountered in the preceding chapter: predicting handwritten digits with the MNIST data set. As you’ll see, our simple model from chapter 5 is remarkably close to the Keras syntax already, so using Keras should come even easier.

With Keras, you can define two types of models: sequential and more general nonsequential models. You’ll use only sequential models here. Both model types can be found in keras.models. To define a sequential model, you have to add layers to it, just as you did in chapter 5 in your own implementation. Keras layers are available through the keras.layers module. Loading MNIST with Keras is simple; the data set can be found in the keras.datasets module. Let’s import everything you need to tackle this application first.

Listing 6.8. Importing models, layers, and data sets from Keras
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense

Next, you load and preprocess MNIST data, which is achieved in just a few lines. After loading, you flatten the 60,000 training samples and 10,000 test samples, convert them to the float type, and then normalize input data by dividing by 255. This is done because the pixel values of the data set vary from 0 to 255, and you normalize these values to a range of [0, 1], as this will lead to better training of your network. Also, the labels have to be one-hot encoded, just as you did in chapter 5. The following listing shows how to do what we just described with Keras.

Listing 6.9. Loading and preprocessing MNIST data with Keras
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

With data ready to go, you can now proceed to define a neural network to run. In Keras, you initialize a Sequential model and then add layers one by one. In the first layer, you have to provide the input data shape, provided through input_shape. In our case, input data is a vector of length 784, so you have to provide input_shape=(784,) as shape information. Dense layers in Keras can be created with an activation keyword to provide the layer with an activation function. You’ll choose sigmoid, because it’s the only activation function you know so far. Keras has many more activation functions, some of which we’ll discuss in more detail.

Listing 6.10. Building a simple sequential model with Keras
model = Sequential()
model.add(Dense(392, activation='sigmoid', input_shape=(784,)))
model.add(Dense(196, activation='sigmoid'))
model.add(Dense(10, activation='sigmoid'))
model.summary()

The next step in creating a Keras model is to compile the model with a loss function and an optimizer. You can do this by specifying strings, and you’ll choose sgd (stochastic gradient descent) as the optimizer and mean_squared_error as the loss function. Again, Keras has many more losses and optimizers, but to get started, you’ll use the ones you already encountered in chapter 5. Another argument that you can feed into the compilation step of Keras models is a list of evaluation metrics. For your first application, you’ll use accuracy as the only metric. The accuracy metric indicates how often the model’s highest-scoring prediction matches the true label.

Listing 6.11. Compiling a Keras deep-learning model
model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['accuracy'])

The final step for this application is to carry out the training step of the network and then evaluate it on test data. This is done by calling fit on your model by providing not only training data, but also the mini-batch size to work with and the number of epochs to run.

Listing 6.12. Training and evaluating a Keras model
model.fit(x_train, y_train,
          batch_size=128,
          epochs=20)
score = model.evaluate(x_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

To recap, building and running a Keras model proceeds in four steps: data preprocessing, model definition, model compilation, and model training plus evaluation. One of the core strengths of Keras is that this four-step cycle can be done quickly, which leads to a fast experimentation cycle. This is of great importance, because often your initial model definition can be improved a lot by tweaking parameters.

6.3.4. Go move prediction with feed-forward neural networks in Keras

Now that you know what the Keras API for sequential neural networks looks like, let’s turn back to our Go move-prediction use case. Figure 6.3 illustrates this step of the process. You’ll first load the generated Go data from section 6.2, as shown in listing 6.13. Note that, as with MNIST before, you need to flatten Go board data into vectors.

Figure 6.3. A neural network can predict game moves. Having already encoded the game state as a matrix, you can feed that matrix to the move-prediction model. The model outputs a vector representing the probability of each possible move.

Listing 6.13. Loading and preprocessing previously stored Go game data
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

np.random.seed(123)                                     1
X = np.load('../generated_games/features-40k.npy')      2
Y = np.load('../generated_games/labels-40k.npy')
samples = X.shape[0]
board_size = 9 * 9

X = X.reshape(samples, board_size)                      3
Y = Y.reshape(samples, board_size)

train_samples = int(0.9 * samples)                      4
X_train, X_test = X[:train_samples], X[train_samples:]
Y_train, Y_test = Y[:train_samples], Y[train_samples:]

  • 1 By setting a random seed, you make sure this script is exactly reproducible.
  • 2 Load the sample data into NumPy arrays.
  • 3 Transform the input into vectors of size 81, instead of 9 × 9 matrices.
  • 4 Hold back 10% of the data for a test set; train on the other 90%.

Next, let’s define and run a model to predict Go moves for the features X and labels Y you just defined. For a 9 × 9 board, there are 81 possible moves, so you need to predict 81 classes with your network. As a baseline, pretend you just closed your eyes and pointed at a spot on the board at random. There’s a 1 in 81 chance you’d find the next play by pure luck, or 1.2%. So you’d like to see your model significantly exceed 1.2% accuracy.

You define a simple Keras MLP with three Dense layers, each with sigmoid activation functions, that you compile with mean squared error loss and a stochastic gradient descent optimizer. You then let this network train for 15 epochs and evaluate it on test data.

Listing 6.14. Running a Keras multilayer perceptron on generated Go data
model = Sequential()
model.add(Dense(1000, activation='sigmoid', input_shape=(board_size,)))
model.add(Dense(500, activation='sigmoid'))
model.add(Dense(board_size, activation='sigmoid'))
model.summary()

model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['accuracy'])

model.fit(X_train, Y_train,
          batch_size=64,
          epochs=15,
          verbose=1,
          validation_data=(X_test, Y_test))

score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Running this code, you should see the model summary and evaluation metrics printed to the console:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 1000)              82000
_________________________________________________________________
dense_2 (Dense)              (None, 500)               500500
_________________________________________________________________
dense_3 (Dense)              (None, 81)                40581
=================================================================
Total params: 623,081
Trainable params: 623,081
Non-trainable params: 0
_________________________________________________________________

...

Test loss: 0.0129547887068
Test accuracy: 0.0236486486486

Note the line Trainable params: 623,081 in the output; this means the training process is updating the value of over 600,000 individual weights. This is a rough indicator of the computational intensity of the model. It also gives you a rough sense of the capacity of your model: its ability to learn complex relationships. As you compare different network architectures, the total number of parameters provides a way to approximately compare the total size of the models.

As you can see, the prediction accuracy of your experiment is at only around 2.3%, which isn’t satisfying at first sight. But recall that your baseline of randomly guessing moves is about 1.2%. This tells you that although the performance isn’t great, the model is learning and can predict moves better than random.

You can get some insight into the model by feeding it sample board positions. Figure 6.4 shows a board that we contrived to make the right play obvious. Whoever plays next can capture two opponent stones by playing at either A or B. This position doesn’t appear in our training set.

Figure 6.4. An example game position for testing our model. In this position, black can capture two stones by playing at A, or white can capture two stones by playing at B. Whoever plays first in that area has a huge advantage in the game.

Now you can feed that board position into the trained model and print out its predictions.

Listing 6.15. Evaluating the model on a known board position
test_board = np.array([[
    0, 0,  0,  0,  0, 0, 0, 0, 0,
    0, 0,  0,  0,  0, 0, 0, 0, 0,
    0, 0,  0,  0,  0, 0, 0, 0, 0,
    0, 1, -1,  1, -1, 0, 0, 0, 0,
    0, 1, -1,  1, -1, 0, 0, 0, 0,
    0, 0,  1, -1,  0, 0, 0, 0, 0,
    0, 0,  0,  0,  0, 0, 0, 0, 0,
    0, 0,  0,  0,  0, 0, 0, 0, 0,
    0, 0,  0,  0,  0, 0, 0, 0, 0,
]])
move_probs = model.predict(test_board)[0]
i = 0
for row in range(9):
    row_formatted = []
    for col in range(9):
        row_formatted.append('{:.3f}'.format(move_probs[i]))
        i += 1
    print(' '.join(row_formatted))

The output looks something like this:

0.037 0.037 0.038 0.037 0.040 0.038 0.039 0.038 0.036
0.036 0.040 0.040 0.043 0.043 0.041 0.042 0.039 0.037
0.039 0.042 0.034 0.046 0.042 0.044 0.039 0.041 0.038
0.039 0.041 0.044 0.046 0.046 0.044 0.042 0.041 0.038
0.042 0.044 0.047 0.041 0.045 0.042 0.045 0.042 0.040
0.038 0.042 0.045 0.045 0.045 0.042 0.045 0.041 0.039
0.036 0.040 0.037 0.045 0.042 0.045 0.037 0.040 0.037
0.039 0.040 0.041 0.041 0.043 0.043 0.041 0.038 0.037
0.036 0.037 0.038 0.037 0.040 0.039 0.037 0.039 0.037

This matrix maps to the original 9 × 9 board: each number represents the model’s confidence that it should play on that point next. This result isn’t too impressive; it hasn’t even learned not to play on a spot where there’s already a stone. But notice that the scores for the edge of the board are consistently lower than scores closer to the center. The conventional wisdom in Go is that you should avoid playing on the very edge of the board, except at the end of the game and other special situations. So the model has learned a legitimate concept about the game: not by understanding strategy or efficiency, but just by copying what our MCTS bot does. This model isn’t likely to predict many great moves, but it has learned to avoid a whole class of poor moves.

This is real progress, but you can do better. The rest of this chapter addresses shortcomings of your first experiment and improves Go move-prediction accuracy along the way. You’ll take care of the following points:

  • The data you’re using for this prediction task has been generated by using tree search, which has a strong element of randomness. Sometimes MCTS engines generate strange moves, especially when they’re either far ahead or far behind in the game. In chapter 7, you’ll create a deep-learning model from human game play. Of course, humans are also unpredictable, but they’re less likely to play nonsense moves.
  • The neural network architecture you used can be vastly improved. Multilayer perceptrons aren’t well suited to capture Go board data. You have to flatten the two-dimensional board data to a flat vector, thereby losing all spatial information about the board. In section 6.4, you’ll learn about a new type of network that’s much better at capturing the Go board structure.
  • Throughout all networks so far, you used only the sigmoid activation function. In sections 6.5 and 6.6, you’ll learn about two new activation functions that often lead to better results.
  • Up to this point, you’ve used MSE only as a loss function, which is intuitive, but not well suited for your use case. In section 6.5, you’ll use a loss function that’s tailored to classification tasks like ours.

Having addressed most of these points, at the end of this chapter you’ll be able to build a neural network that can predict moves better than your first shot. You’ll learn key techniques to build a significantly stronger bot in chapter 7.

Keep in mind that, ultimately, you’re not interested in predicting moves as accurately as possible, but in creating a bot that can play as well as possible. Even if your deep neural networks may never become extraordinarily good at predicting the next move from historical data, the power of deep learning is that they’ll still implicitly pick up the structure of the game and play reasonable or even very good moves.

6.4. Analyzing space with convolutional networks

In Go, you often see particular local patterns of stones over and over again. Human players learn to recognize dozens of these shapes, and often give them evocative names (like tiger’s mouth, bamboo joint, or my personal favorite, the rabbitty six). To make decisions like a human, our Go AI will also have to recognize many local spatial arrangements. A particular type of neural network called a convolutional network is specially designed for detecting spatial relationships like this. Convolutional neural networks, or CNNs, have many applications beyond games: you’ll find them applied to images, audio, and even text. This section shows how to build CNNs and apply them to Go game data. First, we introduce the concept of convolution. Next, we show how to build CNNs in Keras. Finally, we show useful ways to process the output of a convolutional layer.

6.4.1. What convolutions do intuitively

Convolutional layers and the networks we build from them get their name from a traditional operation from computer vision: convolutions. Convolutions are a straightforward way to transform an image or apply a filter, if you will. For two matrices of the same size, a simple convolution is computed by doing the following:

  1. Multiplying these two matrices element by element
  2. Summing up all the values of the resulting matrix

The output of such a simple convolution is a scalar value. Figure 6.5 shows an example of such an operation, convolving two 3 × 3 matrices to compute a scalar.

Figure 6.5. In a simple convolution, you multiply two matrices of the same size element by element and then sum up all the values.

These simple convolutions alone don’t help you right away, but they can be used to compute more-complex convolutions that prove useful for your use case. Instead of starting with two matrices of the same size, let’s fix the size of the second matrix and increase the size of the first one arbitrarily. In this scenario, you call the first matrix the input image and the second one the convolutional kernel, or simply kernel (sometimes you also see filter used). Because the kernel is smaller than the input image, you can compute simple convolutions on many patches of the input image. In figure 6.6, you see such a convolution operation of a 10 × 10 input image with a 3 × 3 kernel in action.

Figure 6.6. By passing a convolutional kernel over patches of an input image, you can compute a convolution of the image with the kernel. The kernel chosen in this example is a vertical edge detector.

The example in figure 6.6 might give you a first hint at why convolutions are interesting for us. The input image is a 10 × 10 matrix consisting of a center 4 × 8 block of 1s surrounded by 0s. The kernel is chosen so that the first column of the matrix (–1, –2, –1) is the negative of the third column (–1, –2, –1), and the middle column is all 0s. Therefore, the following points are true:

  • Whenever you apply this kernel to a 3 × 3 patch of the input image in which all pixel values are the same, the output of the convolution will be 0.
  • When you apply this convolutional kernel to an image patch in which the left column has higher values than the right, the convolution will be negative.
  • When you apply this convolutional kernel to an image patch in which the right column has higher values than the left, the convolution will be positive.

The convolutional kernel is chosen to detect vertical edges in the input image. Edges on the left of an object will have positive values; edges on the right will have negative ones. This is exactly what you can see in the result of the convolution in figure 6.6.

The kernel in figure 6.6 is a classical kernel used in many applications and is called a Sobel kernel. If you flip this kernel by 90 degrees, you end up with a horizontal edge detector. In the same way, you can define convolutional kernels that blur or sharpen an image, detect corners, and many other things. Many of these kernels can be found in standard image-processing libraries.

What’s interesting is to see that convolutions can be used to extract valuable information from image data, which is exactly what you intend to do for your use case of predicting the next move from Go data. Although in the preceding example we chose a particular convolutional kernel, the way convolutions are used in neural networks is that these kernels are learned from data by backpropagation.

So far, we’ve discussed how to apply one convolutional kernel to one input image only. In general, it’s useful to apply many kernels to many images to produce many output images. How can you do this? Let’s say you have four input images and define four kernels. Then you can sum up the convolutions for each input and arrive at one output image. In what follows, you’ll call the output images of such convolutions feature maps. Now, if you want to have five resulting feature maps instead of one, you define five kernels per input image instead of one. Mapping n input images to m feature maps, by using n × m convolutional kernels, is called a convolutional layer. Figure 6.7 illustrates this situation.

Figure 6.7. In a convolutional layer, a number of input images is operated on by convolutional kernels to produce a specified number of feature maps.

Seen this way, a convolution layer is a way to transform a number of input images to output images, thereby extracting relevant spatial information of the input. In particular, as you might have anticipated, convolutional layers can be chained, thereby forming a neural network of convolutional layers. Usually, a network that consists of convolutional and dense layers only is referred to as convolutional neural network, or simply a convolutional network.

Tensors in deep learning

We stated that the output of a convolutional layer is a bunch of images. Although it can certainly be helpful to think of it that way, there’s also a bit more to it. Just as vectors (1D) consist of individual entries, they’re not just a bunch of numbers. In the same way, matrices (2D) consist of column vectors, but have an inherent two-dimensional structure that’s used in matrix multiplications and other operations (such as convolutions). The output of a convolutional layer has a three-dimensional structure. The filters in a convolutional layer have even one dimension more and possess a 4D structure (a 2D filter for each combination of input and output image). And it doesn’t stop there—it’s common for advanced deep-learning techniques to deal with even higher-dimensional data structures.

In linear algebra, the higher-dimensional equivalent of vectors and matrices is tensors. Appendix A goes into a little more detail, but we can’t go into the definition of tensors here. For the rest of this book, you don’t need any formal definition of tensors. But apart from having heard about the concept, tensors give us convenient terminology that we use in later chapters. For instance, the collection of images coming out of a convolutional layer can be referred to as 3-Tensor. The 4D filters in a convolutional layer form a 4-Tensor. So you could say that a convolution is an operation in which a 4-Tensor (the convolutional filters) operates on a 3-Tensor (the input images) to transform it into another 3-Tensor.

More generally, you can say that a sequential neural network is a mechanism that transforms tensors of varying dimension step-by-step. This idea of input data “flowing” through a network by using tensors is what led to the name TensorFlow, Google’s popular machine-learning library that you’ll use to run your Keras models.

Note that in all of this discussion, we’ve talked only about how to feed data through a convolutional layer, but not how backpropagation would work. We leave this part out on purpose, because it would mathematically go beyond the scope of this book, but more importantly, Keras takes care of the backward pass for us.

Generally, a convolutional layer has a lot fewer parameters than a comparable dense layer. If you were to define a convolutional layer with kernel size (3, 3) on a 28 × 28 input image, leading to an output of size 26 × 26, the convolutional layer would have 3 × 3 = 9 parameters. In a convolutional layer, you’ll usually have a bias term as well that’s added to the output of each convolution, resulting in a total of 10 parameters. If you compare this to a dense layer connecting an input vector of length 28 × 28 to an output vector of length 26 × 26, such a dense layer would have 28 × 28 × 26 × 26 = 529,984 parameters, excluding biases. At the same time, convolution operations are computationally more costly than regular matrix multiplications used in dense layers.

6.4.2. Building convolutional neural networks with Keras

To build and run convolutional neural networks with Keras, you need to work with a new layer type called Conv2D that carries out convolutions on two-dimensional data, such as Go board data. You’ll also get to know another layer called Flatten that flattens the output of a convolutional layer into vectors, which can then be fed into a dense layer.

To start, the preprocessing step for your input data now looks a little different than before. Instead of flattening the Go board, you keep its two-dimensional structure intact.

Listing 6.16. Loading and preprocessing Go data for convolutional neural networks
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Conv2D, Flatten            1

np.random.seed(123)
X = np.load('../generated_games/features-40k.npy')
Y = np.load('../generated_games/labels-40k.npy')

samples = X.shape[0]
size = 9
input_shape = (size, size, 1)                       2

X = X.reshape(samples, size, size, 1)               3

train_samples = int(0.9 * samples)
X_train, X_test = X[:train_samples], X[train_samples:]
Y_train, Y_test = Y[:train_samples], Y[train_samples:]

  • 1 Import two new layers, a 2D convolutional layer, and one that flattens its input to vectors.
  • 2 The input data shape is three-dimensional; you use one plane of a 9 × 9 board representation.
  • 3 Then reshape your input data accordingly.

Now you can use the Keras Conv2D object to build the network. You use two convolutional layers, and then flatten the output of the second and follow up with two dense layers to arrive at an output of size 9 × 9, as before.

Listing 6.17. Building a simple convolutional neural network for Go data with Keras
model = Sequential()
model.add(Conv2D(filters=48,                         1
                 kernel_size=(3, 3),                 2
                 activation='sigmoid',
                 padding='same',                     3
                 input_shape=input_shape))

model.add(Conv2D(48, (3, 3),                         4
                 padding='same',
                 activation='sigmoid'))

model.add(Flatten())                                 5

model.add(Dense(512, activation='sigmoid'))
model.add(Dense(size * size, activation='sigmoid'))  6
model.summary()

  • 1 The first layer in your network is a Conv2D layer with 48 output filters.
  • 2 For this layer, you choose a 3 × 3 convolutional kernel.
  • 3 Normally, the output of a convolution is smaller than the input. By adding padding=’same’, you ask Keras to pad your matrix with 0s around the edges, so the output has the same dimension as the input.
  • 4 The second layer is another convolution. You leave out the filters and kernel_size arguments for brevity.
  • 5 You then flatten the 3D output of the previous convolutional layer...
  • 6 ...and follow up with two more dense layers, as you did in the MLP example.

The compiling, running, and evaluating of this model can stay exactly the same as in the MLP example. The only things you changed are the input data shape and the specification of the model itself.

If you run the preceding model, you’ll see that the test accuracy has barely budged: it should land somewhere around 2.3% again. That’s completely fine—you have a few more tricks to unlock the full power of your convolutional model. For the rest of this chapter, you’ll introduce more-advanced deep-learning techniques to improve your move-prediction accuracy.

6.4.3. Reducing space with pooling layers

One common technique that you’ll find in most deep-learning applications featuring convolutional layers is that of pooling. You use pooling to downsize images, to reduce the number of neurons a previous layer has.

The concept of pooling is easily explained: you down-sample images by grouping or pooling patches of the image into a single value. The example in figure 6.8 demonstrates how to cut an image by a factor of 4 by keeping only the maximum value in each disjoint 2 × 2 patch of the image.

Figure 6.8. Reducing an 8 × 8 image to an image of size (4, 4) by applying a 2 × 2 max pooling kernel

This technique is called max pooling, and the size of the disjoint patches used for pooling is referred to as pool size. You can define other types of pooling as well; for instance, computing the average of the values in a patch. This version is called average pooling.

You can define a neural network layer, usually preceding or following a convolutional layer, as follows.

Listing 6.18. Adding a max pooling layer of pool size (2, 2) to a Keras model
model.add(MaxPooling2D(pool_size=(2, 2)))

You can also experiment with replacing MaxPooling2D by AveragePooling2D in listing 6.4. In cases such as image recognition, pooling is in practice often indispensable to reduce the output size of convolutional layers. Although the operation loses a little information by down-sampling images, it’ll usually retain enough of it to make accurate predictions, but at the same time reducing the amount of computation needed quite drastically.

Before you see pooling layers in action, let’s discuss a few other tools that will make your Go move predictions much more accurate.

6.5. Predicting Go move probabilities

Since we first introduced neural networks in chapter 5, you’ve used only a single activation function: the logistic sigmoid function. Also, you’ve been using mean squared error as a loss function throughout. Both choices are good first guesses and certainly have their place in your deep-learning toolbox, but aren’t particularly well suited for our use case.

In the end, when predicting Go moves, what you’re really after is this question: for each possible move on the board, how likely is it that this move is the next move? At each point in time, many good moves are usually available on the board. You set up your deep-learning experiments to find the next move from the data you feed into the algorithm, but ultimately the promise of representation learning, and deep learning in particular, is that you can learn enough about the structure of the game to predict the likelihood of a move. You want to predict a probability distribution of all possible moves. This can’t be guaranteed with sigmoid activation functions. Instead, you introduce the softmax activation function, which is used to predict probabilities in the last layer.

6.5.1. Using the softmax activation function in the last layer

The softmax activation function is a straightforward generalization of the logistic sigmoid σ. To compute the softmax function for a vector x = (x1, ..., xl), you first apply the exponential function to each component; you compute exi. Then you normalize each of these values by the sum of all values:

By definition, the components of the softmax function are non-negative and add up to 1, meaning the softmax spits out probabilities. Let’s compute an example to see how it works.

Listing 6.19. Defining the softmax activation function in Python
import numpy as np


def softmax(x):
    e_x = np.exp(x)
    e_x_sum = np.sum(e_x)
    return e_x / e_x_sum

x = np.array([100, 100])
print(softmax(x))

After defining softmax in Python, you compute it on a vector of length 2; namely, x = (100, 100). If you compute the sigmoid of x, the outcome will be close to (1, 1). But computing the softmax for this example yields (0.5, 0.5). This is what you should’ve expected: because the values of the softmax function sum up to 1, and both entries are the same, softmax assigns both components equal probability.

Most often, you see the softmax activation function applied as the activation function of the last layer in a neural network, so that you get a guarantee on predicting output probabilities.

Listing 6.20. Adding a max pooling layer of pool size (2, 2) to a Keras model
model.add(Dense(9*9, activation='softmax'))

6.5.2. Cross-entropy loss for classification problems

In the preceding chapter, you started out with mean squared error as your loss function and we remarked that it’s not the best choice for your use case. To follow up on this, let’s have a closer look at what might go wrong and propose a viable alternative.

Recall that you formulated your move-prediction use case as a classification problem, in which you have 9 × 9 possible classes, only one of which is correct. The correct class is labeled as 1, and all others are labeled as 0. Your predictions for each class will always be a value between 0 and 1. This is a strong assumption on the way your prediction data looks, and the loss function you’re using should reflect that. If you look at what MSE does, taking the square of the difference between prediction and label, it makes no use of the fact that you’re constrained to a range of 0 to 1. In fact, MSE works best for regression problems, in which the output is a continuous range. Think of predicting the height of a person. In such scenarios, MSE will penalize large differences. In your scenario, the absolute largest difference between prediction and actual outcome is 1.

Another problem with MSE is that it penalizes all 81 prediction values the same way. In the end, you’re concerned only with predicting the one true class, labeled 1. Let’s say you have a model that predicts the correct move with a value of 0.6 and all others 0, except for one, which the model assigns to 0.4. In this situation, the mean squared error is (1 – 0.6)2 + (0 – 0.4)2 = 2 × 0.42, or about 0.32. Your prediction is correct, but you assign the same loss value to both nonzero predictions: about 0.16. Is it really worth putting the same emphasis on the smaller value? If you compare this to the situation in which the correct move gets 0.6 again, but two other moves receive a prediction of 0.2, then the MSE is (0.4)2 + 2 × 0.22, or roughly 0.24, a significantly lower value than in the preceding scenario. But what if the value 0.4 really is more accurate, in that it’s just a strong move that may also be a candidate for the next move? Should you really penalize this with your loss function?

To take care of these issues, we introduce the categorical cross-entropy loss function, or cross-entropy loss for short. For labels ŷ and predictions y of a model, this loss function is defined as follows:

Note that although this might look like a sum consisting of many terms, involving a lot of computation, for our use case this formula boils down to just a single term: the one for which ŷi is 1. The cross-entropy error is simply –log(yi) for the index i for which ŷi = 1. Simple enough, but what do you gain from this?

  • Because cross-entropy loss penalizes only the term for which the label is 1, the distribution of all other values isn’t directly affected by it. In particular, in the scenario in which you predict the correct next move with a probability of 0.6, there’s no difference between attributing one other move a likelihood of 0.4 or two with 0.2. The cross-entropy loss is –log(0.6) = 0.51 in both cases.
  • Cross-entropy loss is tailored to a range of [0,1]. If your model predicts a probability of 0 for the move that actually happened, that’s as wrong as it can get. You know that log(1) = 0 and that –log(x) for x between 0 and 1 approaches infinity as x approaches 0, meaning that –log(x) becomes arbitrarily large (and doesn’t just grow quadratically, as MSE).
  • On top of that, MSE falls off more quickly as x approaches 1, meaning that you get a much smaller loss for less-confident predictions. Figure 6.9 gives a visual comparison of MSE and cross-entropy loss.
Figure 6.9. Plot of MSE vs. cross-entropy loss for the class labeled as 1. Cross-entropy loss attributes a higher loss for each value in the range [0,1].

Another crucial point that distinguishes cross-entropy loss from MSE is its behavior during learning with stochastic gradient descent (SGD). As a matter of fact, the gradient updates for MSE get smaller and smaller as you approach higher prediction values (y getting closer to 1); learning typically slows down. Compared to this, cross-entropy loss doesn’t show this slowdown in SGD, and the parameter updates are proportional to the difference between prediction and true value. We can’t go into the details here, but this represents a tremendous benefit for our move-prediction use case.

Compiling a Keras model with categorical cross-entropy loss, instead of MSE, is again simply achieved.

Listing 6.21. Compiling a Keras model with categorical cross-entropy
model.compile(loss='categorical_crossentropy'...)

With cross-entropy loss and softmax activations in your tool belt, you’re now much better equipped to deal with categorical labels and predicting probabilities with a neural network. To finish off this chapter, let’s add two techniques that will allow you to build deeper networks—networks with more layers.

6.6. Building deeper networks with dropout and rectified linear units

So far, you haven’t built a neural network with more than two to four layers. It might be tempting to just add more of the same in the hope that results will improve. It’d be great if it were that simple, but in practice you have a few aspects to consider. Although continually building deeper and deeper neural networks increases the number of parameters a model has and thereby its capacity to adapt to data you feed into it, you may also run into trouble doing so. Among the prime reasons that this might fail is overfitting: your model gets better and better at predicting training data, but performs suboptimal on test data. Put to an extreme, there’s no use for a model that can near perfectly predict, or even memorize, what it has seen before, but doesn’t know what to do when confronted with data that’s a little different. You need to be able to generalize. This is particularly true for predicting the next move in a game as complex as Go. No matter how much time you spend on collecting training data, situations will always arise in game play that your model hasn’t encountered before. In any case, it’s important to find a strong next move.

6.6.1. Dropping neurons for regularization

Preventing overfitting is a common challenge in machine learning in general. You can find a lot of literature about regularization techniques that are designed to address the issue of overfitting. For deep neural networks, you can apply a surprisingly simple, yet effective, technique called dropout. With dropout applied to a layer in a network, for each training step you pick a number of neurons at random and set them to 0; you drop these neurons entirely from the training procedure. At each training step, you randomly select new neurons to drop. This is usually done by specifying a dropout rate, the percentage of neurons to drop for the layer at hand. Figure 6.10 shows an example dropout layer in which probabilistically half of the neurons are dropped for each mini-batch (forward and backward pass).

Figure 6.10. A dropout layer with a rate of 50% will randomly drop half of the neurons from the computation for each mini-batch of data fed into the network.

The rationale behind this process is that by dropping neurons randomly, you prevent individual layers, and thereby the whole network, from specializing too much to the given data. Layers have to be flexible enough not to rely too much on individual neurons. By doing so, you can keep your neural network from overfitting.

In Keras, you can define a Dropout layer with a dropout rate as follows.

Listing 6.22. Importing and adding a Dropout layer to a Keras model
from keras.layers import Dropout

...
model.add(Dropout(rate=0.25))

You can add dropout layers like this in a sequential network before or after every other layer available. Especially in deeper architectures, adding dropout layers is often indispensable.

6.6.2. The rectified linear unit activation function

As a last building block for this chapter, you’ll get to know the rectified linear unit (ReLU) activation function, which turns out to often yield better results for deep networks than sigmoid and other activation functions. Figure 6.11 shows what ReLU looks like.

Figure 6.11. The ReLU activation function sets negative inputs to 0 and leaves positive inputs as is.

ReLU ignores negative inputs by setting them to 0 and returns positive inputs unchanged. The stronger the positive signal, the stronger the activation with ReLUs. Given this interpretation, rectified linear unit activation functions are pretty close to a simple model of neurons in the brain, in which weaker signals are ignored, but stronger ones lead to the neuron firing. Beyond this basic analogy, we’re not going to argue for or against any theoretical benefits of ReLUs, but just note that using them often leads to satisfactory results. To use ReLUs in Keras, you replace sigmoid with relu in the activation argument of your layers.

Listing 6.23. Adding a rectified linear activation to a Dense layer
from keras.layers import Dense

...
model.add(Dense(activation='relu'))

6.7. Putting it all together for a stronger Go move-prediction network

The preceding sections covered a lot of ground and introduced not only convolutional networks with max pooling layers, but also cross-entropy loss, the softmax activation for last layers, dropout for regularization, and ReLU activations to improve performance of your networks. To conclude this chapter, let’s put every new ingredient you learned about together into a neural network for your Go move-prediction use case and see how well you do now.

To start, let’s recall how to load Go data, encoded with your simple one-plane encoder, and reshape it for a convolutional network.

Listing 6.24. Loading and preprocessing Go data for convolutional neural networks
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

np.random.seed(123)
X = np.load('../generated_games/features-40k.npy')
Y = np.load('../generated_games/labels-40k.npy')

samples = X.shape[0]
size = 9
input_shape = (size, size, 1)

X = X.reshape(samples, size, size, 1)

train_samples = int(0.9 * samples)
X_train, X_test = X[:train_samples], X[train_samples:]
Y_train, Y_test = Y[:train_samples], Y[train_samples:]

Next, let’s enhance your previous convolutional network from listing 6.3 as follows:

  • Keep the basic architecture intact, starting with two convolutional layers, then a max pooling layer, and two dense layers to finish off.
  • Add three dropout layers for regularization: one after each convolutional layer and one after the first dense layer. Use a dropout rate of 50%.
  • Change the output layer to a softmax activation, and the inner layers to ReLU activations.
  • Change the loss function to cross-entropy loss instead of mean squared error.

Let’s see what this model looks like in Keras.

Listing 6.25. Building a convolutional network for Go data with dropout and ReLUs
model = Sequential()
model.add(Conv2D(48, kernel_size=(3, 3),
                 activation='relu',
                 padding='same',
                 input_shape=input_shape))
model.add(Dropout(rate=0.5))
model.add(Conv2D(48, (3, 3),
                 padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(rate=0.5))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(size * size, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

Finally, to evaluate this model, you can run the following code.

Listing 6.26. Evaluating your enhanced convolutional network
model.fit(X_train, Y_train,
          batch_size=64,
          epochs=100,
          verbose=1,
          validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Note that this example increases the number of epochs to 100, whereas you used 15 before. The output looks something like this:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 9, 9, 48)          480
_________________________________________________________________
dropout_1 (Dropout)          (None, 9, 9, 48)          0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 9, 9, 48)          20784
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 4, 4, 48)          0
_________________________________________________________________
dropout_2 (Dropout)          (None, 4, 4, 48)          0
_________________________________________________________________
flatten_1 (Flatten)          (None, 768)               0
_________________________________________________________________
dense_1 (Dense)              (None, 512)               393728
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 81)                41553
=================================================================
Total params: 456,545
Trainable params: 456,545
Non-trainable params: 0
_________________________________________________________________
...
Test loss: 3.81980572336
Test accuracy: 0.0834942084942

With this model, your test accuracy goes up to over 8%, which is a solid improvement over your baseline model. Also, note the Trainable params: 456,545 in the output. Recall that your baseline model had over 600,000 trainable parameters. While increasing the accuracy by a factor of three, you also cut the number of weights. This means the credit for the improvement must go to the structure of your new model, not just its size.

On the negative side, the training took a lot longer, in large part because you increased the number of epochs. This model is learning more-complicated concepts, and it needs more training passes. If you have the patience to set epochs even higher, you can pick up a few more percentage points of accuracy with this model. Chapter 7 introduces advanced optimizers that can speed up this process.

Next, let’s feed the example board to the model and see what moves it recommends:

0.000 0.001 0.001 0.002 0.001 0.001 0.000 0.000 0.000
0.001 0.006 0.011 0.023 0.017 0.010 0.005 0.002 0.000
0.001 0.011 0.001 0.052 0.037 0.026 0.001 0.003 0.001
0.002 0.020 0.035 0.045 0.043 0.030 0.014 0.006 0.001
0.003 0.020 0.030 0.031 0.039 0.039 0.018 0.007 0.001
0.001 0.021 0.033 0.048 0.050 0.032 0.017 0.006 0.001
0.001 0.010 0.001 0.039 0.035 0.022 0.001 0.004 0.001
0.000 0.006 0.008 0.017 0.017 0.010 0.007 0.002 0.000
0.000 0.000 0.001 0.001 0.002 0.001 0.001 0.000 0.000

The highest-rated move on the board has a score of 0.052—and maps to point A in figure 6.4, where black captures the two white stones. Your model may not be a master tactician yet, but it has definitely learned something about capturing stones! Of course, the results are far from perfect: it still gives high scores to many points that already have stones on them.

At this point, we encourage you to experiment with the model and see what happens. Here are a few ideas to get you started:

  • What’s most effective on this problem: max pooling, average pooling, or no pooling? (Remember that removing the pooling layer increases the number of trainable parameters in the model; so if you see any extra accuracy, keep in mind that you’re paying for it with extra computation.)
  • Is it more effective to add a third convolutional layer, or to increase the number of filters on the two layers that are already there?
  • How small can you make the second-to-last Dense layer and still get a good result?
  • Can you improve the result by changing the dropout rate?
  • How accurate can you make the model without using convolutional layers? How does the size and training time of that model compare to your best results with a CNN?

In the next chapter, you’ll apply all the techniques you learned here to build a deep-learning Go bot that’s trained on actual game data, not just simulated games. You’ll also see new ways to encode the inputs, which will improve model performance. With these techniques combined, you can build a bot that makes reasonable moves and can at least beat beginner Go players.

6.8. Summary

  • With encoders, you can transform Go board states into inputs for neural networks, which is an important first step toward applying deep learning to Go.
  • Generating Go data with tree search gives you a first Go data set to apply neural networks to.
  • Keras is a powerful deep-learning library with which you can create many relevant deep-learning architectures.
  • Using convolutional neural networks, you can leverage the spatial structure of input data to extract relevant features.
  • With pooling layers, you can reduce image sizes to reduce computational complexity.
  • Using softmax activations in the last layer of your network, you can predict output probabilities.
  • Working with categorical cross-entropy as a loss function is a more natural choice for Go move-prediction networks than mean squared error. Mean squared error is more useful when you’re trying to predict numbers in a continuous range.
  • With dropout layers, you have a simple tool to avoid overfitting for deep network architectures.
  • Using rectified linear units instead of sigmoid activation functions can bring a significant performance boost.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset