On October 18, DeepMind published an article about the new achievements of AlphaGo in Nature
magazine. The new version of the program was called Zero, because it was trained from scratch without using data received from a person, except for the rules of the Go game itself. For training the previous version, which won the championships with people, initially used the method of training with the teacher (supervised learning), and only then reinforcement training (reinforcement learning). That is, initially the program was trained by studying human experience and only then on the game against its own versions. That is, the old version initially learned to predict human moves. AlphaGo Zero has become its own teacher: the neural network trained to predict its own choice, as well as the choice of the champion version.
The creators of the program reasonably claim that Zero is currently the most powerful Go player in history.
Previous versions of AlphaGo initially learned to play Go on thousands of human parties, from amateur to professional. Zero was rid of human prejudice, she skipped this stage, began to learn, playing with herself, making initially just random moves. The program soon surpassed the human level and defeated the champion version.
But getting rid of the influence of human experience is not the only change. The official site
mentions the use of a new form of training with reinforcement, the essence of which is not fully disclosed. It is clear that the neural network is combined with a powerful search algorithm. In the process of games, the neural network coefficients are adjusted and updated. Then, the updated neural network is again recombined with the search algorithm to obtain a stronger version of AlphaGo Zero. And so the iteration behind the iteration of the system develops, however, the level of the game also grows.
But after this muddy description, the authors again say that the main advantage of the new method is that AlphaGo is no longer constrained by the limits of human knowledge. Instead, she can learn from scratch with the most powerful player in the world — AlphaGo itself.
However, several other differences are also mentioned:
- Zero uses only the black and white colors of the stones on the board as input, while the previous versions were supplied with a small amount of artificial parameters.
- In the previous versions, two separate networks “policy network” (to select the next move) and “value network” (to predict the potential winner from each position) were used. In the new version of Zero, they were combined, which allowed more efficient learning.
- Also, AlphaGo Zero no longer uses “rollouts” - a quick random play of games by other Go games in order to predict which player will win from the current position. Here, emphasis is placed on the high quality of evaluative neural networks.
All these changes, according to the authors, helped to improve the performance of the system, its power and efficiency, and at the same time made it more universal. If the system can learn independently from scratch, it means that it can be “transplanted” from the game of Go to any other branch of human knowledge. The company DeepMind has long stated that their mission is to create a general-purpose artificial intelligence, a unified system that could solve various tasks out of the box.
An important discovery lies in the fact that AlphaGo not only learned how to play like humans, but that it developed its fundamentally new and extremely effective approaches to playing Go, its strategies that people, playing this game for thousands of years, did not guess. Not only that in a short period of time she mastered the knowledge that took people thousands of years, she developed fundamentally new knowledge. And if this system showed such high efficiency in such a complicated matter as the game of Go, the next step the creators of the system see the search for its application in other industries.