AlphaGo: Defeating the Best Go Players

14 minutes, 5 links
From

editione1.0.2

Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

In the Introduction, we discussed the Go competition between Lee Sedol and AlphaGo. Well, DeepMind developed AlphaGo with the goal of playing Go against the Grandmasters. October 2015 was the first time that software beat a human at Go, a game with has around positions, more possible positions than the number of moves in chess or even the total number of atoms in the universe (around ). In fact, if every atom in the universe were a universe itself, there would be fewer atoms than the number of positions in a Go game.

In many countries such as South Korea and China, Go is considered a national game, like football and basketball are in the US, and these countries have many professional Go players, who train from the age of 6.* If these players show promise in the game, they switch from a normal school to a special Go school where they play and study Go for 12 hours a day, 7 days a week. They live with their Go Master and other prodigy children. So, it is a serious matter for a computer program to challenge these players.

There are around 2,000 professional Go players in the world, along with roughly 40 million casual players. In an interview at the Google Campus,* Hassabis shared, “We knew that Go was much harder than chess.” He describes how he initially thought of building AlphaGo the same way that Deep Blue was built, that is, by building a system that did a brute-force search with a handcrafted set of rules.

But he realized that this technique would never work since the game is very contextual, meaning there was no way to create a program that could determine how one part of the board would affect other parts because of the huge number of possible states. At the same time, he realized that if he created an algorithm to beat the Master players in Go, then he would probably have made a significant advance in AI, more meaningful than Deep Blue.

Go is not only hard because the game has an astronomical number of possibilities, but for a program to be good at playing Go, it needs to determine the best next move. To figure that out, the software needs to determine if a position is good or not. It cannot play all possibilities until the end because there were too many. Conventional wisdom thought it impossible to determine the value of a certain game state for Go. It’s much simpler to do this for chess. For example, you can codify things like pawn structure and piece mobility, which are techniques Grandmasters use to determine if a position is good or bad. In Go, on the other hand, all pieces are the same. Even a single stone can modify the outcome of the game, so each one has a profound impact on the game.

What makes Go even harder is that it is a constructive game as opposed to a destructive one. In chess, you start with all the pieces on the board and take them away as you play, making the game simpler. The more you play, the fewer the possibilities there are for the next moves. Go, however, begins with an empty board, and you add pieces, making it harder to analyze. In chess, if you analyze a complicated middle game, you can evaluate the current situation, and that tells everything. To analyze a middle game in Go, you have to project into the future to examine the current situation of the board, which makes it much harder to analyze. In reality, Go is more about intuition and instinct rather than calculation like chess.

When describing the algorithm that AlphaGo produced, Hassabis said that it does not merely regurgitate human ideas and copy them. It genuinely comes up with original ideas. According to him, Go is an objective art because anyone can come up with an original move, but you can measure if that move or idea was pivotal for winning the game. Even though Go has been played at a professional level for 3,000 years, AlphaGo created new techniques and directly influenced how people played because AlphaGo was strategic and seemed human.

The Workings of AlphaGo

Figure: The Policy Network. Probability Distribution over moves. The darker the green squares, the higher the probability.

Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Now Available

DeepMind developed AlphaGo mainly with two neural networks.* The first, the Policy Network, calculates the probability of the next move for a professional player given the state of the board. Instead of looking for the next 200 moves, the Policy Network only looks at the next five to ten. By doing that, it reduces the number of positions AlphaGo has to search in order to find what is the next best move.

Figure: The Value Network. How the position evaluator sees the board. Darker blue represents places where the next stone leads to a more likely win for the player.

Initially, AlphaGo was trained using around 100,000 Go games from the internet. But once it could narrow down the search for the next move, AlphaGo played against itself millions of times and improved through reinforcement learning. The program learned through its own mistakes. If it won, it would make it more likely to make those moves in the next game. With that information, it created its own database of millions of games of the system playing against itself.

Then, DeepMind trained a second neural network, the Value Network, which calculated the probability of a player winning based on the state of the board. It outputs 1 for black winning, 0 for white, and 0.5 for a tie. Together, these two networks turned what seemed an intractable problem into a manageable one. They transformed the problem of playing Go into a similar problem of solving a chess game. The Policy Network provides the ten next probable moves, and the Value Network gives the score of a board state. Given these two networks, the algorithm to find the best moves, the Monte Carlo Tree Search—similar to the one used to play chess, i.e., the min-max search described earlier—uses the probabilities to explore the space of possibilities.

The Match

AlphaGo played the best player in the world, Lee Sedol, a legend of the game who had won over 18 world titles. At stake was $1M of bounty.

Figure: AlphaGo’s first 99 moves in game 2. AlphaGo controls the black piece.

AlphaGo played unimaginable moves, changing the way people would play Go in the years ahead. The European champion, Fen Hui, told Hassabis that his mind was free from the shackles of tradition after seeing the innovative plays from AlphaGo; he now considered unthinkable thoughts and was not constrained by the received wisdom that had bound him for decades.

“‘I was deeply impressed,’ Ke Jie said through an interpreter after the game. Referring to a type of move that involves dividing an opponent’s stones, he added: ‘There was a cut that quite shocked me, because it was a move that would never happen in a human-to-human Go match.’”* Jie, the current number one Go player in the world, stated, “Humanity has played Go for thousands of years, and yet, as AI has shown us, we have not even scratched the surface.” He continued, “The union of human and computer players will usher in a new era. Together, man and AI can find the truth of Go.”*

AlphaGo Zero

The DeepMind team did not stop after winning against the best player in the world. They decided to improve the software, so they merged the two deep neural networks, the Policy and Value Networks, creating AlphaGo Zero. “Merging these functions into a single neural network made the algorithm both stronger and much more efficient,” said David Silver, the lead researcher on AlphaGo. AlphaGo Zero removed a lot of redundancies between the two neural networks. The probability of adding a piece to a position (Policy Network) contains the information of who might win the game (Value Network). If there is only one position for the player to place a piece, it might mean that the player is cornered and has a low chance of winning. Or, if the player can place a piece anywhere on the board, that probably means that the player is winning because it does not matter where it places the next piece on the board. And because merging the networks removed these redundancies, the resulting neural network was smaller, making it much less complex and, therefore, more efficient. The network had less weights and required less training to figure out. “It still required a huge amount of computing power—four of the specialized chips called tensor processing units (TPUs), which Hassabis estimated to be US$25M of hardware. But its predecessors used ten times that number. It also trained itself in days rather than months. The implication is that ‘algorithms matter much more than either computing or data available’, said Silver.”*

Critics point out that what AlphaGo does is not actually learning because if you were to make the agent play the game with the same rules but changed the color of the pieces, then the program would be confused and would perform terribly. Humans can play well if there are minor changes to the game. But that can be solved by training the software with different games as well as different colored pieces. DeepMind made its algorithms generalize for other games like Atari by using a technique called synaptic consolidation. Without this new method, when an agent first learns to play the game, the agent saturates the neural connections with the knowledge on how to play the first game. Then, when the agent starts learning how to play a variation of the game or a different game, all the connections are destroyed in order to learn how to play the second game, producing catastrophic forgetting.

If a simple neural network is trained to play Pong controlling the paddle on the left, then after it learns how to play the game, the agent will always win, obtaining a score of 20 to 0. If you change the color of the paddle from green to black, the agent is still controlling the same paddle, but it will miss the ball every time. It ends up losing 20 to 0, showing a catastrophic failure. DeepMind solved this using inspiration from how a mouse’s brain works, and presumably how the human brain works as well. In the brain, there is a process called synaptic consolidation. It is the process in which the brain protects only neural connections that form when a particular new skill is learned.

Figure: AlphaGo Zero Elo rating over time. At 0 days, AlphaGo Zero has no prior knowledge of the game and only the basic rules as an input. At 3 days, AlphaGo Zero surpasses the abilities of AlphaGo Lee, the version that beat world champion Lee Sedol in 4 out of 5 games in 2016. At 21 days, AlphaGo Zero reaches the level of AlphaGo Master, the version that defeated 60 top professionals online and world champion Ke Jie in 3 out of 3 games in 2017. At 40 days, AlphaGo Zero surpasses all other versions of AlphaGo and, arguably, becomes the best Go player in the world. It does this entirely from self-play, with no human intervention and using no historical data.

Taking inspiration from biology, DeepMind developed an algorithm that does the same. After it plays a game, it identifies and protects only the neural connections that are the most important for that game. That means that whenever the agent starts playing a new game, spare neurons can be used to learn a new game, and at the same time, the knowledge that was used to play the old game is preserved, eliminating catastrophic forgetting. With this technique, DeepMind showed that the agents could learn how to play ten Atari games without forgetting how to play the old ones at superhuman level. The same technique can be used to make AlphaGo learn to play different variations of the Go game.

OpenAI

OpenAI, a research institute started by tech billionaires including Elon Musk, Sam Altman, Peter Thiel, and Reid Hoffman, had a new challenge. OpenAI was started to advance artificial intelligence and prevent the technology from turning dangerous. In 2016, led by CTO and co-founder, Greg Brockman, they started looking on Twitch, an internet gaming community, to find the most popular games that had an interface a software program could interact with and that ran on the Linux operating system. They selected Dota 2. The idea was to make a software program that could beat the best human player as well as the best teams in the world—a lofty goal. By a wide margin, Dota 2 would be the hardest game at which AI would beat humans.

At first glance, Dota 2 may look less cerebral than Go and chess because of its orcs and creatures. The game, however, is much more difficult than those strategy games because the board itself and the number of possible moves is much greater than the previous games. Not only that, but there are around 110 heroes, and each one has at least four moves. The average number of possible moves per turn for chess is around 20 and for Go, about 200. Dota 2 has on average 1,000 possible moves for every eighth of a second, and the average match lasts around 45 minutes. Dota 2 is no joke.

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!