Policy and value network

We can start with solving Go. Go is a simple game that is thousands of years old. It's a two player game with full information, meaning that two players face each other and there is no hidden knowledge; everything is contained on the board (contrary to, say, card games like poker). At each turn, the player places one of their stones (either white for the first player or black for the second) on the board, possibly changing the color of other stones in the process, and the games ends with whoever has the most stones of their color.

The issue is that the board is quite big, 19 x 19 squares, meaning that at the beginning you have a very big set of possible options. Which one leads to winning the game?

For chess, this was solved without neural networks. There are only a subset of possible moves, and a fast computer can now analyze all of these moves up to a given depth. At the leaves of this analysis tree, it is possible in chess to know if we are closer to winning or if we are on the verge of losing. For Go, it's not possible.

Enter deep learning. For Go, we still need to analyse different possible moves, but we won't try to be as exhaustive as in chess; we will instead use Monte-Carlo Tree Search (MCTS). This means that we will draw a random uniform number and from this number we will play one move. We do this for several moves in advance and then we assess whether we are closer to winning or not.

But as we saw before, we can't measure this is Go, so how do we select a move for the search and how can we decide if we are winning or losing? This is why we have two networks. The policy network will provide the probabilities for the next move, and the value network will provide one value—either it thinks we are winning or we are losing.

Combined together, it is possible to generate a set of possible moves with their odds of success, and we can play them. At the end of the game, we use the information to reinforce the two networks.

Later, the new AlphaGo Zero merged these networks together. There was no longer a policy and a value network, and it got far better at playing than the original AlphaGo. So we don't have to have a dichotomy for such problems, as it is possible to design an architecture that does both.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset