DeepMind: Learning from Experience

5 minutes, 5 links
From

editione1.0.2

Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

Demis Hassabis was a child prodigy in chess, reaching the Master standard at age 13, the second highest-rated player in the World Under-14 category, and also “cashed at the World Series of Poker six times including in the Main Event.”* In 1994 at 18, he began his computer games career co-designing and programming the classic game Theme Park, which sold millions of copies.* He then became the head of AI development for an iconic game called Black & White at Lionhead Studios. Hassabis earned his PhD from the University College London in cognitive neuroscience in 2009.

Figure: Demis Hassabis, CEO of DeepMind.

In 2010, Hassabis co-founded DeepMind in London with the mission of “solving intelligence” and then using that intelligence to “solve everything else.” Early in its development, DeepMind focused on algorithms that mastered games, starting with games developed for Atari.* Google acquired DeepMind in 2014 for $525M.

DeepMind Plays Atari

Figure: Breakout game.

To help the program play the games, the team at DeepMind developed a new algorithm, Deep Q-Network (DQN), that learned from experience. It started playing games like the famous Breakout game, interpreting the video and producing a command on the joystick. If the command produced an action where the player scored, then the learning software reinforced that action. The next time it played the game, it would likely do the same action. It is reinforcement learning, but with a deep neural network to determine the quality of a state-action combination. The DNN helps determine which action to take given the state of the game, and the algorithm learns over time after playing a few games and determining the best actions to take at each point.

Figure: Games that DeepMind’s software played on Atari.* The AI performed better than human level at the ones above the line.

For example, in the case of Breakout,* after playing a hundred games, the software was still pretty bad and missed the ball often. But it kept playing, and after a few hours—300 games—the software improved and played at human ability. It could return the ball and keep it alive for a long time. After they let it play for a few more hours—500 games—it became better than the average human, learning to do a trick called tunneling, which involves systematically sending the ball to the side walls so that it bounces around on top, requiring less work and earning more reward. The same learning algorithm worked not only on Breakout but also for most of the 57 games that DeepMind tried the technique on, achieving superhuman level for most of them.

Figure: Montezuma’s Revenge.

The learning algorithm, however, did not perform well for all games. Looking at the bottom of the list, the software got a score of zero on Montezuma’s Revenge. DeepMind’s DQN software does not succeed in this game because the player needs to understand high-level concepts that people learn throughout their lifetime. For example, if you look at the game, you know that you are controlling the character and that ladders are for climbing, ropes are for swinging, keys are probably good, and the skull is probably bad.

Figure: Montezuma’s Revenge (left) and the teacher and student neural networks (right)

DeepMind improved the system by breaking the problem into simpler tasks. If the software could solve things like “jump across the gap,” “get to the ladder,” and “get past the skull and pick up the key,” then it could solve the game and perform well at the task. To attack this problem, DeepMind created two neural networks—the teacher and the student. The teacher is responsible for learning and producing these subproblems. The teacher sends these subproblems to another neural network called the student. The student takes actions in the game and tries to maximize the score, but it also tries to do what the teacher tells it. Even though they were trained with the same data as the old algorithm, plus some additional information, the communication between the teacher and the student allowed strategy and communication to emerge over time, helping the agent learn how to play the game.

AlphaGo: Defeating the Best Go Players

In the Introduction, we discussed the Go competition between Lee Sedol and AlphaGo. Well, DeepMind developed AlphaGo with the goal of playing Go against the Grandmasters. October 2015 was the first time that software beat a human at Go, a game with has around positions, more possible positions than the number of moves in chess or even the total number of atoms in the universe (around ). In fact, if every atom in the universe were a universe itself, there would be fewer atoms than the number of positions in a Go game.

In many countries such as South Korea and China, Go is considered a national game, like football and basketball are in the US, and these countries have many professional Go players, who train from the age of 6.* If these players show promise in the game, they switch from a normal school to a special Go school where they play and study Go for 12 hours a day, 7 days a week. They live with their Go Master and other prodigy children. So, it is a serious matter for a computer program to challenge these players.

There are around 2,000 professional Go players in the world, along with roughly 40 million casual players. In an interview at the Google Campus,* Hassabis shared, “We knew that Go was much harder than chess.” He describes how he initially thought of building AlphaGo the same way that Deep Blue was built, that is, by building a system that did a brute-force search with a handcrafted set of rules.

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!