List of Figures

Chapter 1. Toward deep learning: a machine-learning introduction

Figure 1.1. The standard programming paradigm that most software developers are familiar with. The developer identifies the algorithm and implements the code; the users supply the data.

Figure 1.2. The machine-learning paradigm: during development, you generate an algorithm from a data set, and then incorporate that into your final application.

Figure 1.3. A simple example data set. Each point on the graph represents a soccer player’s height and weight. Your goal is to fit a model to these points.

Figure 1.4. First you note that your data set roughly follows a linear trend, then you find the formula for a specific line that fits the data.

Figure 1.5. Machine-learning algorithms operate on mathematical structures, such as vectors and matrices. Your photo tags are stored in a standard computer data structure: a list of strings. This is one possible scheme for encoding that list as a mathematical vector.

Figure 1.6. A machine-learning pipeline for supervised learning

Figure 1.7. An unsupervised machine-learning pipeline for finding clusters or chunks of chess pieces

Figure 1.8. In reinforcement learning, agents learn to interact with their environment by trial and error. You repeatedly have your agent attempt its task to get a supervised signal to learn from. With every cycle, you can make an incremental improvement.

Figure 1.9. Deep learning and representation learning

Chapter 2. Go as a machine-learning problem

Figure 2.1. A standard 19 × 19 Go board. The intersections marked with the dots are the star points, which are solely for players’ reference. Stones go on the intersections.

Figure 2.2. The three black stones are connected. They have four liberties on the points marked with squares. White can capture the black stones by placing white stones on all the liberties.

Figure 2.3. The white stones on the left can never be captured: black can play at neither A nor B. A black stone there would have no liberties, and is therefore an illegal play. On the other hand, black can play at C to capture five white stones.

Figure 2.4. The final positions of a 9 × 9 game. Dead stones are marked with an ×. Black territory is marked with a triangle. White territory is marked with a square.

Figure 2.5. An illustration of the ko rule. First, black captures one white stone. White would like to capture back by playing at A, but that would revert the board to the previous position. The ko rule forbids such a play, in order to prevent an endless cycle of capture and recapture. White must play somewhere else on the board first. After that, the overall board position is new, so white can come back to capture at A later—assuming black doesn’t protect it first.

Chapter 3. Implementing your first Go bot

Figure 3.1. In this Go game, black has three strings of stones, and white has two. The large white string has six liberties, and the single white stone has only three.

Figure 3.2. Black can capture a white stone, thereby regaining a liberty for each of the Go strings adjacent to the captured stone.

Figure 3.3. In this Go board state, the three black stones have one liberty left; namely, the marked point. You enforce the self-capture rule, so black isn’t allowed to play there. White, on the other hand, can capture the three black stones by playing on the marked point.

Figure 3.4. In this situation, the marked point is a capture for black, not suicide, because black will capture two white stones and thereby regain two liberties.

Figure 3.5. White wants to capture the two black stones in this situation, but white’s stone has only one liberty left.

Figure 3.6. Continuing, black tries to rescue its two stones by capturing the isolated white stone.

Figure 3.7. In this situation, white can just snap back (retake the three black stones) without violating the ko rule.

Figure 3.8. White has two eyes in the corner, at A and B, and shouldn’t place a stone at either point. Otherwise, black can capture the whole group. Your naive bot won’t be allowed to fill its own eye.

Figure 3.9. Using Zobrist hashes to encode moves and efficiently store game state

Chapter 4. Playing games with tree search

Figure 4.1. What move should × make next? This is an easy one: playing in the lower-right corner wins the game.

Figure 4.2. An illustration of an algorithm to find the winning move. You start with the position at the top. You loop over every possible move and calculate the game state that would result if you played that move. Then you check whether that hypothetical game state is a winning position for ×.

Figure 4.3. What move should O make next? If O plays in the lower left, you must assume that × will follow up in the lower right to win the game. O must find the only move that prevents this.

Figure 4.4. What move should × make? If × plays in the center, there will be two ways to complete three in a row: (1) top middle and (2) lower right. O can block only one of these options, so × is guaranteed a win.

Figure 4.5. A tic-tac-toe game tree. In the top position, it’s ×’s turn. If × plays in the top center, O can guarantee a win. If × plays in the left center, × will win. If × plays in the right center, O can force a draw. Therefore, × will choose to play in the left center.

Figure 4.6. The width of a tic-tac-toe game tree: the maximum width is 9, because you have 9 possible options on the first move. But the number of legal moves decreases on each turn, so the average width is 4 or 5 moves.

Figure 4.7. The depth of a tic-tac-toe game tree: the maximum depth is 9 moves; after that, the board is full.

Figure 4.8. Pruning can quickly shrink a game tree. This tree has width 4 and height 3, for a total of 64 leaves to examine. Suppose you find a way to eliminate 1 of the 4 possible options on each turn. Then you end up with just 27 leaves to visit.

Figure 4.9. A partial Go game tree. Here you search the tree to a depth of 2 moves ahead. At that point, you look at the number of captures to evaluate the board. If black chooses the rightmost branch, white can capture a stone, yielding an evaluation of –1 for black. If black chooses the center branch, the black stone is safe (for now). That branch is evaluated at a score of 0. Therefore, black chooses the center branch.

Figure 4.10. The black player is considering playing at the point marked with a square. If black plays there, white can respond at A to capture four stones. That result is so bad for black that you can immediately reject playing on the square; there’s no need for black to consider other white responses, such as B.

Figure 4.11. Step 1 in alpha-beta pruning: you fully evaluate the first possible black move; this move is evaluated at a score of +2 for black. So far, the algorithm is exactly the same as the depth-pruned search covered in the previous section.

Figure 4.12. Step 2 in alpha-beta pruning: you now evaluate the second possible black move. Here, white has a response that captures four stones. That branch evaluates to –4 for black. As soon as you evaluate that white response, you can discard this move for black entirely and skip the other possible white responses. It’s possible that white has an even better reply that you didn’t evaluate, but all you need to know is that playing at B is worse for black than playing at A.

Figure 4.13. An MCTS game tree. The top of the tree represents the current board position; you’re trying to find the next move for black. At this point in the algorithm, you’ve performed 70 random rollouts from various possible positions. Each node tracks statistics on all the rollouts that started from any of its children.

Figure 4.14. Adding a new node to an MCTS tree. Here you select the leftmost branch as the place to insert a new node. You then randomly select the next move from the position to create the new node in the tree.

Figure 4.15. Updating MCTS after a new rollout. In this scenario, the rollout is a win for white. You add that win into the new tree node and all its parent nodes.

Figure 4.16. The exploitation-exploration trade-off. In both game trees, you’ve visited seven board positions. On top, the search is biased more toward exploitation: the tree is deeper for the most promising moves. On the bottom, the search is biased more toward exploration: it tries more moves, but to less depth.

Figure 4.17. An example of a local tactical pattern. When you see the shape on the left, you should consider the response on the right. A policy that follows tactical patterns like this one won’t be especially strong, but will be much stronger than choosing moves completely at random.

Chapter 5. Getting started with neural networks

Figure 5.1. A few samples from the MNIST data set for handwritten digits, a well-studied entity in the field of optical character recognition

Figure 5.2. This is what an average handwritten 8 from the MNIST training set looks like. Averaging many hundreds of images in general will result in an unrecognizable blob, but this average 8 still looks very much like an 8.

Figure 5.3. Plot of the sigmoid function. The sigmoid maps real values to a range of [0, 1]. Around 0, the slope is rather steep, and the curve flattens out both for small and large values.

Figure 5.4. An example of logistic regression, mapping an input vector x of length 4 to an output value y between 0 and 1. The schematic indicates how the output y depends on all four values in the input vector x.

Figure 5.5. In this simple network, four input neurons are connected to two output neurons by means of first multiplying with a two-by-four matrix, adding a two-dimensional bias term, and then applying the sigmoid component-wise.

Figure 5.6. An artificial neural network with two layers. The input neurons x connect to an intermediate set of units z, which themselves connect to the output neurons y.

Figure 5.7. A neural network with three layers. When defining a neural network, you’re limited in neither the number of layers, nor the number of neurons per layer.

Figure 5.8. An example of a loss function for two-dimensional input (a loss surface). This surface has a minimum around the dark area in the lower right that can be computed by solving the derivative of the loss function.

Figure 5.9. Iteratively following the gradients of a loss function will eventually lead you to a minimum.

Figure 5.10. Stochastic gradients are less precise, so when following them on a loss surface, you might take a few detours before closing in on a local minimum.

Figure 5.11. Forward and backward passes in a two-layer feed-forward neural network with sigmoid activations and MSE loss function

Figure 5.12. Class diagram for your Python implementation of a feed-forward network. A SequentialNetwork contains several Layer instances. Each Layer implements a mathematical function and its derivative. The forward and backward methods implement the forward and backward pass, respectively. A Loss instance calculates your loss function, the error between your prediction and your training data.

Chapter 6. Designing a neural network for Go data

Figure 6.1. How to predict the next move in a game of Go by using deep learning

Figure 6.2. An illustration of the Encoder class. It takes your GameState class and translates it into a mathematical form—a NumPy array.

Figure 6.3. A neural network can predict game moves. Having already encoded the game state as a matrix, you can feed that matrix to the move-prediction model. The model outputs a vector representing the probability of each possible move.

Figure 6.4. An example game position for testing our model. In this position, black can capture two stones by playing at A, or white can capture two stones by playing at B. Whoever plays first in that area has a huge advantage in the game.

Figure 6.5. In a simple convolution, you multiply two matrices of the same size element by element and then sum up all the values.

Figure 6.6. By passing a convolutional kernel over patches of an input image, you can compute a convolution of the image with the kernel. The kernel chosen in this example is a vertical edge detector.

Figure 6.7. In a convolutional layer, a number of input images is operated on by convolutional kernels to produce a specified number of feature maps.

Figure 6.8. Reducing an 8 × 8 image to an image of size (4, 4) by applying a 2 × 2 max pooling kernel

Figure 6.9. Plot of MSE vs. cross-entropy loss for the class labeled as 1. Cross-entropy loss attributes a higher loss for each value in the range [0,1].

Figure 6.10. A dropout layer with a rate of 50% will randomly drop half of the neurons from the computation for each mini-batch of data fed into the network.

Figure 6.11. The ReLU activation function sets negative inputs to 0 and leaves positive inputs as is.

Chapter 7. Learning from data: a deep-learning bot

Figure 7.1. Building a deep-learning Go bot, using real-world Go data for training. You can find game records from public Go servers to use for training a bot. In this chapter, you’ll learn how to find those records, transform them into a training set, and train a Keras model to imitate the human players’ decisions.

Figure 7.2. Replaying a game record from an SGF file. The original SGF file encodes game moves with strings such as B[ee]. The Sgf_game class decodes those strings and returns them as Python tuples. You can then apply these moves to your GameState object to reconstruct the game, as shown in the following listing.

Figure 7.3. The process_zip function. You iterate over a zip file that contains many SGF files. Each SGF file contains a sequence of game moves; you use those to reconstruct GameState objects. Then you use an Encoder object to convert each game state to a NumPy array.

Chapter 8. Deploying bots in the wild

Figure 8.1. Building a web frontend for your Go bot. The httpfrontend module starts a Flask web server that decodes HTTP requests and passes them to one or more Go-playing agents. In the browser, a client based on the jgoboard library communicates with the server over HTTP.

Figure 8.2. Running a Python web application to play against a Go bot in your browser

Figure 8.3. The training process for a deep-learning Go bot

Figure 8.4. A snapshot of how Pachi and your bot see and evaluate a game between them

Chapter 9. Learning by practice: reinforcement learning

Figure 9.1. The reinforcement-learning cycle. You can implement reinforcement learning in many ways, but the overall process has a common structure. First, a computer program attempts a task repeatedly. The records of these attempts are called experience data. Next, you modify the behavior to imitate the more successful attempts; this process is training. You then periodically evaluate the performance to confirm that the program is improving. Normally, you need to repeat this process for many cycles.

Figure 9.2. A game of 5 × 5 Go translated into the language of reinforcement learning. The agent that you want to train is the black player. It sees a sequence of states (board positions) and chooses actions (legal moves). At the end of an episode (a complete game), it gets a reward to indicate whether it achieved its goal. In this case, black wins the game, so the agent sees a reward of +1.

Figure 9.3. The move-selection process. First you encode a game state as a numerical tensor; then you can pass that tensor to your model to get move probabilities. You sample from all points on the board according to the move probabilities to get an order in which to try the moves.

Chapter 10. Reinforcement learning with policy gradients

Figure 10.1. This graph shows four sample games played by a random agent. The bars indicate how often the agent chose each of the five possible moves in each game. Although the agent used the same policy in all the games, the exact counts vary quite a bit from game to game.

Figure 10.2. This graph shows how a policy evolves under your simplified policy-learning scheme. Over the course of hundreds of games, the agent gradually becomes less likely to choose the worst move (playing a 1). Likewise, the agent gradually becomes more likely to choose the best move (playing a 5). Both curves are wobbly, because the policy sometimes takes a small step in the wrong direction.

Figure 10.3. A flowchart for policy gradient learning. You start with a collection of game records and their outcomes. For each move the agent chose, you want to either increase the probability of that move (if the agent won the game) or decrease the probability (if the agent lost the game). Gradient descent handles the mechanics of updating the policy weights. After a pass of gradient descent, the probabilities will be shifted in the desired direction.

Figure 10.4. This graph shows how a hypothetical objective function may vary with a learnable weight. You want to move the value of u (Greek letter theta) from its current location to the minimum. You can think of gradient descent as causing the weight to roll downhill.

Figure 10.5. In this example, the learning rate is too small, and you need many updates before the weight reaches the minimum.

Figure 10.6. Here, the learning rate is too large. The weight overshoots its target. On the next round of learning, the gradient will point in the opposite direction, but it’s likely to overshoot again on the next turn. This may cause the weight to bounce back and forth around the true minimum.

Figure 10.7. In this case, the learning rate is so large that the weight jumped all the way to the flat region on the right. The gradient is 0 in that region, so the optimizer no longer has a clue which direction to go. The weight may be stuck there permanently. This is a common problem in neural networks that use rectified linear units.

Figure 10.8. In policy gradient learning, you are trying to estimate the true gradient from a very noisy signal. Sometimes a single estimate will point in the wrong direction. If you move the weight too far in the wrong direction, it can jump all the way from the true minimum in the middle to the local minimum on the left, and it may get stuck there a while.

Chapter 11. Reinforcement learning with value methods

Figure 11.1. An action-value function takes two inputs: a state (board position) and an action (proposed move). It returns an estimate of the expected return (chance of winning the game) if the agent chooses this action. The action-value function is traditionally called Q in mathematical notation.

Figure 11.2. In a greedy action-value policy, you loop over every possible move and estimate the action value. Then you choose the action with the highest estimated value. (Many legal moves have been omitted to save space.)

Figure 11.3. Flowchart for the ϵ-greedy action-value policy. This policy tries to balance playing the best move with exploring unknown moves. The value of ϵ controls that balance.

Figure 11.4. Setting up training data for deep Q-learning. At the top, we show how we created training data for the move-prediction networks you used in chapters 6 and 7. The board position was the input, and the actual move was the output. The bottom shows the structure for training data for Q-learning. Both the board position and chosen move are inputs; the output is the game outcome: a 1 for a win and a –1 for a loss.

Figure 11.5. The Keras Concatenate layer appends two tensors into one.

Figure 11.6. A move-prediction network, as covered in chapters 6 and 7. Although there are many layers, you can conceptually think of it as two steps. The convolutional layers process the raw stones and organize them into logical groups and tactical shapes. From that representation, the dense layers can choose an action.

Figure 11.7. The tanh (hyperbolic tangent) function, which clamps its value between –1 and 1

Figure 11.8. The two-input neural network described in listing 11.4. The game board goes through several convolutional layers, just like the move-prediction network from chapter 7. The proposed move goes into a separate input. The proposed move is combined with the output of the convolutional layers and passed through another dense layer.

Figure 11.9. An illustration of the argsort function from the NumPy library. argsort takes a vector of values that you want to sort. Instead of sorting the values directly, it returns a vector of indices that will give you the values in sorted order. So the first value of the output vector is the index of the smallest value in the input, and the last value of the output vector is the index of the largest value in the input.

Chapter 12. Reinforcement learning with actor-critic methods

Figure 12.1. Estimated values over the course of a hypothetical game. This game lasted 200 moves. In the beginning, the learning agent pulled slightly ahead; then it fell far behind; then it suddenly reversed the game and came out with a win.

Figure 12.2. The advantages for each move in a hypothetical game. The learning agent won the game, so its final reward was 1. The moves that led to the comeback have an advantage close to 2, so they’ll be strongly reinforced during training. The moves near the end of the game, when the outcome was already decided, have an advantage close to 0, so they’ll be nearly ignored during training.

Figure 12.3. A neural network suitable for actor-critic learning for Go. This network has a single input, which takes a representation of the current board position. The network produces two outputs. One output indicates which moves it should play—this is the policy output, or the actor. The other output indicates which player is ahead in the game—this is the value output, or the critic. The critic isn’t used in playing a game but helps the training process.

Figure 12.4. Training setup for actor-critic learning. The neural network has two outputs: one for the policy and one for the value. Each gets its own training target. The policy output is trained against a vector the same size as the board. The cell in the vector corresponding to the chosen move is filled in with the advantage calculated for that move; the rest are zero. The value output is trained against the final outcome of the game.

Chapter 13. AlphaGo: Bringing it all together

Figure 13.1. The legendary shoulder hit that AlphaGo played against Lee Sedol in the second game of their series. This move stunned many professional players.

Figure 13.2. How to train the three neural networks that power the AlphaGo AI. Starting with a collection of human game records, you can train two neural networks to predict the next move: a small, fast network and a large, strong network. You can then further improve the playing strength of the large network through reinforcement learning. The self-play games also provide data to train a value network. AlphaGo then uses all three networks in a tree-search algorithm that can produce incredibly strong game play.

Figure 13.3. AlphaGo encoded many Go tactical concepts directly into its feature planes, including ladders. In the first example, a white stone has just one liberty—meaning black could capture on the next turn. The white player extends the white stone to gain an extra liberty. But black can again reduce the white stones to one liberty. This sequence continues until it hits the edge of the board, where white is captured. On the other hand, if there’s a white stone in the path of the ladder, white may be able to escape capture. AlphaGo included feature planes that indicated whether a ladder would be successful.

Figure 13.4. The supervised training process for AlphaGo’s policy networks is exactly the same as the flow covered in chapters 6 and 7. You replay human game records and reproduce the game states. Each game state is encoded as a tensor (this diagram shows a tensor with only two planes; AlphaGo used 48 planes). The training target is a vector the same size as the board, with a 1 where the human actually played.

Figure 13.5. To evaluate possible board positions, AlphaGo combines two independent evaluations. First, it feeds the board position to its value network, which directly returns an estimated chance of winning. Second, it uses the fast policy network to complete the game from that position and observes who won. The evaluation used in the tree search is a weighted sum of those two parts.

Chapter 14. AlphaGo Zero: Integrating tree search with reinforcement learning

Figure 14.1. Encoding possible moves as a vector. As in previous chapters, AGZ uses a vector where each element maps to a point on the game board. AGZ adds one last element that maps to the pass move. This example is on a 5 × 5 board, so the vector has dimension 26: 25 for the points on the board and 1 for passing.

Figure 14.2. A partial AGZ-style search tree. In this case, it’s black’s turn to move, and the search has explored three possible game states so far. The tree also contains branches that represent moves the search hasn’t yet visited; most of those have been omitted for space.

Figure 14.3. Choosing which branch to follow in the AGZ tree search. In this example, you’re considering three branches from the starting position. (In reality, there would be many more possible moves, which we omitted for space.) To choose a branch, you consider the number of times you’ve already visited the branch, your estimated value for the branch, and the prior probability of the move.

Figure 14.4. Expanding an AGZ-style search tree. First, you calculate a new game state. You create a new node from that game state and add it to the tree. The neural network then gives you a value estimate for that game state. Finally, you update all parents of the new node. You increment the visit count N by one and update the average value V. Here, T represents the total value across all visits through a node; that’s just bookkeeping to make it easy to recalculate the average.

Figure 14.5. Normalizing a vector. During self-play, you track the number of times you visit each move. For training, you must normalize the vector so it sums to 1.

Figure 14.6. A residual block. The output of the two inner layers is added to the output of the previous layer. The effect is that the inner layers learn the difference, or residual, between the objective and what the previous layer learned.

Appendix A. Mathematical foundations

Figure A.1. Representing a Go game board with a two-plane tensor. This is a 5 × 5 board. You use one channel for black stones and a separate channel for white stones. So you use a 2 × 5 × 5 tensor to represent the board.

Figure A.2. A function and its derivative. Where the derivative is positive, the function is increasing. Where the derivative is negative, the function is decreasing. When the derivative is exactly zero, the function is at a local minimum or maximum. With this logic, you can use the derivative to find local minima or maxima.

Appendix D. Training and deploying bots by using Amazon Web Services

Figure D.1. Signing up for an AWS account

Figure D.2. Selecting the Elastic Cloud Compute (EC2) service from the Services menu

Figure D.3. Launching a new AWS instance

Figure D.4. Selecting the AWS Marketplace

Figure D.5. Choosing an AMI suited for deep learning

Figure D.6. Pricing for your deep-learning AMI, depending on the instance you choose

Figure D.7. Selecting the right instance type for your needs

Figure D.8. Configuring security groups for your AWS instance

Figure D.9. Creating a new key pair to access your AWS instance

Figure D.10. Creating a new key pair to access your AWS instance

Appendix E. Submitting a bot to the Online Go Server

Figure E.1. Contacting an OGS moderator to activate your bot account

Figure E.2. Checking the profile page of your bot to see it has been activated

Figure E.3. Your bot should now show up as a Computer opponent in the OGS match finder.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset