A 100 game match versus Stockfish 8 using 64 threads and a transposition table size of 1GiB, was won by AlphaZero using a single machine with 4 Tensor processing units (TPUs) with +28=72-0. Despite a possible hardware advantage of AlphaZero and criticized playing conditions [2], this seems a tremendous achievement.
Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved a superhuman level of play in the games of chess and Shogi as well as in Go. The algorithm is a more generic version of the AlphaGo Zero algorithm that was first introduced in the domain of Go [5] . AlphaZero evaluatespositions using non-linear function approximation based on a deep neural network, rather than the linear function approximation as used in classical chess programs. This neural network takes the board position as input and outputs a vector of move probabilities. The MCTS consists of a series of simulated games of self-play whose move selection is controlled by the neural network. The search returns a vector representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.
The deep hidden layers connect the pieces on different squares to each other due to consecutive 3x3 convolutions, where a cell of a layer is connected to the correspondent 3x3 receptive field of the previous layer, so that after 4 layers, each square is connected to every other cell in the original input layer [8]. The output of the neural network is finally represented as an 8x8 board array as well, for every origin square up to 73 target square possibilities (NRayDirs x MaxRayLength + NKnightDirs + NPawnDirs * NMinorPromotions), encoding a probability distribution over 64x73 = 4,672 possible moves, where illegal moves were masked out by setting their probabilities to zero, re-normalising the probabilities for remaining moves.
Training
AlphaZero was trained in 700,000 steps or mini-batches of size 4096 each, starting from randomly initialized parameters, using 5,000 first-generation TPUs[9] to generate self-play games and 64 second-generation TPUs[10][11][12] to train the neural networks [13] .
a chess and Go playing entity by Google DeepMind based on a general reinforcement learning algorithm with the same name. On December 5, 2017, the DeepMind team around David Silver, Thomas Hubert, and Julian Schrittwieser along with former Giraffe author Matthew Lai, reported on their generalized algorithm, combining Deep learning with Monte-Carlo Tree Search (MCTS) [1] .
A 100 game match versus Stockfish 8 using 64 threads and a transposition table size of 1GiB, was won by AlphaZero using a single machine with 4 Tensor processing units (TPUs) with +28=72-0. Despite a possible hardware advantage of AlphaZero and criticized playing conditions [2], this seems a tremendous achievement.
Table of Contents
Description
Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved a superhuman level of play in the games of chess and Shogi as well as in Go. The algorithm is a more generic version of the AlphaGo Zero algorithm that was first introduced in the domain of Go [5] . AlphaZero evaluates positions using non-linear function approximation based on a deep neural network, rather than the linear function approximation as used in classical chess programs. This neural network takes the board position as input and outputs a vector of move probabilities. The MCTS consists of a series of simulated games of self-play whose move selection is controlled by the neural network. The search returns a vector representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.Network Architecture
The network is a deep residual convolutional neural network [6] [7] with many layers of spatial NxN planes - 8x8 board arrays for chess. The input describes the chess position from side's to move point of view - that is color flipped for black to move. Each square cell consists of 12 piece-type and color bits, e.g. from the current bitboard board definition, and to address graph history and path-dependency - times eight, that is up to seven predecessor positions as well - so that en passant, immediate repetitions, or some sense of progress is implicit. Additional inputs, redundant inside each square cell to be conform to the convolution net, consider castling rights, halfmove clock, total move count and side to move.The deep hidden layers connect the pieces on different squares to each other due to consecutive 3x3 convolutions, where a cell of a layer is connected to the correspondent 3x3 receptive field of the previous layer, so that after 4 layers, each square is connected to every other cell in the original input layer [8]. The output of the neural network is finally represented as an 8x8 board array as well, for every origin square up to 73 target square possibilities (NRayDirs x MaxRayLength + NKnightDirs + NPawnDirs * NMinorPromotions), encoding a probability distribution over 64x73 = 4,672 possible moves, where illegal moves were masked out by setting their probabilities to zero, re-normalising the probabilities for remaining moves.
Training
AlphaZero was trained in 700,000 steps or mini-batches of size 4096 each, starting from randomly initialized parameters, using 5,000 first-generation TPUs [9] to generate self-play games and 64 second-generation TPUs [10] [11] [12] to train the neural networks [13] .See also
Publications
Forum Posts
2017
Re: AlphaZero is not like other chess programs by Rein Halbersma, CCC, December 09, 2017
2018
External Links
GitHub - suragnair/alpha-zero-general: A clean and simple implementation of a self-play learning algorithm based on AlphaGo Zero (any game, any framework!)
Reports
Stockfish Match
Misc
lineup: Irmin Schmidt, Michael Karoli, Holger Czukay, Damo Suzuki, Jaki Liebezeit
References
What links here?
Up one Level