A Brief Overview of Deep Learning

2 hours, 79 links


Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

The fundamental shift in solving problems that probabilistic reasoning brought to AI from 1993 to 2011 was a big step forward, but probability and statistics only took developers so far. Geoffrey Hinton created a breakthrough technique called backpropagation to usher in the next era of artificial intelligence: deep learning. His work with multilayer neural networks is the basis of modern-day AI development.

Deep learning is a class of machine learning methods that uses multilayer neural networks that are trained through techniques such as supervised, unsupervised, and reinforcement learning.

In 2012, Geoffrey Hinton and the students at his lab showed that deep neural networks, trained using backpropagation, beat the best algorithms in image recognition by a wide margin.

Right after that, deep learning took off by unlocking a ton of potential. Its first large-scale use was with Google Brain. Led by Andrew Ng, who fed 10 million YouTube videos to 1,000 computers, the system was able to recognize cats and detect faces without hard-coded rules.

Then, deep learning was used by DeepMind to create the first machine learning system to defeat the best Go players in the world. DeepMind combined techniques from previous Go players, using Monte Carlo tree search combined with two deep neural networks that computed the probability of the next possible moves and the chances of winning from each of those. This was a breakthrough due to how hard the game is compared to other games such as chess. The number of total states is magnitudes larger compared to Chess.

The same team used deep neural networks to determine the 3D structure of proteins based on their genome, creating what is known as AlphaFold. This was a solution for a 50-year-old grand challenge in biology.*

New techniques emerged, including the creation of generative adversarial networks, which have two neural networks playing a cat and mouse game wherein one creates fake images that look like the real ones fed into it and the other decides whether they are real. This new technique has been used to create images that look real.

While deep learning required the creation of a new software system, TensorFlow, and new hardware, graphical processing units (GPUs) and tensor processing units (TPUs), what was most needed was a way to train these deep neural networks. Deep learning is only as successful as the training data. Fei-Fei Li’s work created an instrumental dataset, ImageNet, that was used to not only train the algorithms but also as a benchmark in the field. Without that data, deep learning would not be where it is today.

Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Now Available

Unfortunately, with data collection at such a massive scale, privacy problems become a concern. While we have come to expect data breaches, it does not have to be the case. Apple and Google continue to fine-tune approaches that do not require the collection of personal data. Differential privacy and federated learning are examples of today’s technology. They allow models to update and learn without leaking individual information about the data that is being used for training.

The Main Approaches to Machine Learning

I learned very early the difference between knowing the name of something and knowing something.Richard Feynman*

Machine learning algorithms usually learn by analyzing data and inferring what kind of model or parameters a model should have or by interacting with the environment and getting feedback from it. Humans can annotate this data or not, and the environment can be simulated or the real world.

The three main categories that machine learning algorithms can use to learn are supervised learning, unsupervised learning, and reinforcement learning. Other techniques can be used, such as evolution strategies or semi-supervised learning, but they are not as widely used or as successful as the three above techniques.

Supervised Learning

Supervised learning has been widely used in training computers to tag objects in images and to translate speech to text. Let’s say you own a real estate business and one of the most important aspects of being successful is to figure out the price for a house when it enters the market. Determining that price is extremely important for completing a sale, making both the buyer and seller happy. You, as an experienced realtor, can figure out the pricing for a house based on your previous knowledge.

But as your business grows, you need help, so you hire new realtors. To be successful, they also need to determine the price of a house in the market. In the interest of helping these inexperienced people, you write down the value of the houses that the company already bought and sold, based on size, neighborhood, and various details, including the number of bathrooms and bedrooms and the final sale price.

BedroomsSq. FeetNeighborhoodSale Price
21,000Potrero Hill$300K
1800Potrero Hill$200K

Table: Sample data for a supervised learning algorithm.

This information is called the training data; that is, it is example data that contains the factors or features that may influence the price of a house in addition to the final sale price. New hires look at all this data to start learning which factors influence the final price of a house. For example, the number of bedrooms might be a great indicator of price, but the size of the house may not necessarily be as important. If inexperienced realtors have to determine the price of a new house that enters the market, they simply check to find a house that is most similar and use that information to determine the price.

BedroomsSq. FeetNeighborhoodSale Price

Table: Missing information that the algorithm will determine.

That is precisely how algorithms learn from training data with a method called supervised learning. The algorithm knows the price of some of the houses in the market, and it needs to figure out how to predict the new price of a house that is entering the market. In supervised learning, the computer, instead of the realtors, figures out the relationship between the data points. The value the computer needs to predict is called the label. In the training data, the labels are provided. When there is a new data point whose value, the label, is not defined, the computer estimates the missing value by comparing it to the ones it has already seen.

Unsupervised Learning

Unsupervised learning is a machine learning technique that learns patterns with unlabeled data.

In our example, unsupervised learning is similar to supervised learning, but the price of each house is not part of the information included in the training data. The data is unlabeled.

BedroomsSq. FeetNeighborhood
2800Potrero Hill
42,000Potrero Hill

Table: Sample training data for an unsupervised learning algorithm.

Even without the price of the houses, you can discover patterns from the data. For example, the data can tell that there is an abundance of houses with two bedrooms and that the average size of a house in the market is around 1,200 square feet. Other information that might be extracted is that very few houses in the market in a certain neighborhood have four bedrooms, or that five major styles of houses exist. And with that information, if a new house enters the market, you can figure out the most similar houses by looking at the features or identifying that the house is an outlier. This is what unsupervised learning algorithms do.

Reinforcement Learning

The previous two ways of learning are based solely on data given to the algorithm. The process of reinforcement learning is different: the algorithm learns by interacting with the environment. It receives feedback from the environment either by rewarding good behavior or punishing bad. Let’s look at an example of reinforcement learning.

Say you have a dog, Spot, who you want to train to sit on command. Where do you start? One way is to show Spot what “sit” means by putting her bottom on the floor. The other way is to reward Spot with a treat whenever she puts her bottom on the floor. Over time, Spot learns that whenever she sits on command she receives a treat and that this is a rewarded behavior.

Reinforcement learning works in the same way. It is a framework built on top of this insight that you can teach intelligent agents, such as a dog or a deep neural network, to achieve a certain task by rewarding them when they correctly perform the task. And whenever the agent achieves the desired outcome, its chance of repeating such an action increases due to the reward. Agents are algorithms which process input and act as a voice for the output. Spot is the agent in the example.

Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones.

Reinforcement learning as a learning framework is interesting, but the associated algorithms are the most important aspect. The way they work is by defining the reward the agent receives once it achieves a state, like sitting. The formulation of reinforcement algorithms is to find a policy, such as a specific mapping of the states to the actions to be taken, which maximizes the expected reward so that the agent learns the behavior that maximizes the reward (the treat).

In the reinforcement learning formulation, the environment gives the reward: the agent does not figure out the reward itself but only receives it by interacting with the environment and hitting on the expected behavior. One problem with this is that the agent sometimes takes a long time to receive a reward. For example, if Spot never sits, then she never receives a treat and does not learn to sit. Or, let’s say you want an agent to learn how to navigate a maze and the reward is only given when the agent exits the maze. If the agent takes too long before leaving, then it is hard to say which actions the agent took that helped it get out of the maze. Another problem is that the agent only learns from its own successes and failures. That is not necessarily the case with humans in the real world. No one needs to drive off a cliff thousands of times to learn how to drive. People can figure out rewards from observation.

Reinforcement Algorithms

The following two steps define reinforcement algorithms:

  • Add randomness to the agent’s actions so it tries something different, and

  • If the result was better than expected, do more of the same in the future.

Adding randomness to the actions ensures that the agent searches for the correct actions to take. And if the result is the one expected, then the agent tries to do more of the same in the future. The agent does not necessarily repeat the exact same actions, however, because it still tries to improve by exploring potentially better actions. Even though reinforcement algorithms can be explained easily, they do not necessarily work for all problems. For reinforcement learning to work, the situation must have a reward, and it is not always easy to define what should or should not be rewarded.

Reinforcement algorithms can also backfire. Let’s say that an agent is rewarded by the number of paper clips it makes. If the agent learns to transform anything into paper clips, it could be that it makes everything into paper clips.* If the reward does not punish the agent when it creates too many paper clips, the agent can misbehave. Reinforcement learning algorithms are also mostly inefficient because they spend a lot of time searching for the correct solution and adding randomized actions to find the right behavior. Even with these limitations, they can accomplish an overwhelming variety of tasks, such as playing Go games at a superhuman level and making robotic arms grasp objects.

Another way of learning that is particularly useful with games is having multiple agents play against each other. Two classic examples are chess or Go, where two agents compete with each other. Agents learn what actions to take by being rewarded when they win the game. This technique is called self-play, and it can be used not only with a reinforcement learning algorithm, but also to generate data. In Go, for example, it can be used to figure out which plays are more likely to make a player win. Self-play generates data from computing power, that is, from the computer playing itself.

The three learning categories are each useful in different situations. Use supervised learning when there is a lot of available data that is labeled by people, such as when others tag people on Facebook. Unsupervised learning is used primarily when there is not much information about the data points that the system needs to figure out, such as in cyber attacks. One can infer that they are being attacked by looking at the data and seeing odd behaviors that were not there before the attack. The last, reinforcement learning, is mainly used when there is not much data about the task that the agent needs to achieve, but there are clear goals, like winning a chess game. Machine learning algorithms, more specifically deep learning algorithms, are trained with these three modes of learning.

Deep Neural Networks

I have always been convinced that the only way to get artificial intelligence to work is to do the computation in a way similar to the human brain. That is the goal I have been pursuing. We are making progress, though we still have lots to learn about how the brain actually works.Geoffrey Hinton*

The Breakthrough

Deep learning is a type of machine learning algorithm that uses multilayer neural networks and backpropagation as a technique to train the neural networks. The field was created by Geoffrey Hinton, the great-great-grandson of George Boole, whose Boolean algebra is a keystone of digital computing.*

The evolution of deep learning was a long process, so we must go back in time to understand it. The technique first arose in the field of control theory in the 1950s. One of the first applications involved optimizing the thrusts of the Apollo spaceships as they headed to the moon.

The earliest neural networks, called perceptrons, were the first step toward human-like intelligence. However, a 1969 book, Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert, demonstrated the extreme limitations of the technology by showing that a shallow network, with only a few layers, could perform only the most basic computational functions. At the time, their book was a huge setback to the field of neural networks and AI.

Getting past the limitations pointed out in Minsky and Papert’s book requires multilayer neural networks. To create a multilayer neural network to perform a certain task, researchers first determine how the neural network will look by determining which neurons connect to which others. But to finish creating such a neural network, the researchers need to find the weights between each of the neurons—how much one neuron’s output affects the next neuron. The training step in deep learning usually does that. In that step, the neural network is presented with examples of data and the training software figures out the correct weights for each connection in the neural network so that it produces the intended results; for example, if the neural network is trained to classify images, then when presented with images that contain cats, it says there is a cat there.

Backpropagation is an algorithm that adjusts the weights in such a way that whenever you change them, the neural network gets closer to the right output faster than was previously possible. The way this works is that the neurons that are closest to the output are the ones adjusted first. Then, after all the classification of images cannot be made better by adjusting those, the prior layer is updated to improve the classification. This process continues until the first layer of neurons is the one adjusted.

In 1986, Hinton published the seminal paper on deep neural networks (DNNs), “Learning representations by back-propagating errors,” with his colleagues David Rumelhart and Ronald Williams.* The article introduced the idea of backpropagation, a simple mathematical technique that led to huge advances in deep learning.

The backpropagation technique developed by Hinton finds the weights for each neuron in a multilayer neural network more efficiently. Before this technique, it took an exponential amount of time to find the weights—also known as coefficients—for a multilayer neural network, which made it extremely hard to find the correct coefficients for each neuron. Before, it took months or years to train a neural network to be the correct one for the inputs, but this new technique took significantly less time.

Hilton’s breakthrough also showed that backpropagation enabled easily training a neural network that had more than two or three layers, breaking through the limitation imposed by shallow neural networks. Backpropagation allowed the innovation of finding the exact weights for a multilayer neural network to create the desired output or outcome. This development allowed scientists to train more powerful neural networks, making them much more relevant. For comparison, one of the most performant neural networks in vision, called Inception, has approximately 22 layers of neurons.

What is a Neural Network?

The figure below shows an example of both a simple neural network (SNN) and a deep learning neural network (DLNN). On the left of each network is the input layer, represented by the red dots. These receive the input data. In the SNN, the hidden layer neurons are then used to make the adjustments needed to reach the output (blue dots) on the right side. In contrast, the use of more than one layer characterizes the DLNN, allowing for far more complex behavior that can handle more involved input.

Figure: A simple neural network and a multilayer neural network.

The way researchers usually develop a neural network is first by defining its architecture: the number of neurons and how they are arranged. But the parameters of the neurons inside the neural network need determining. To do that, researchers initialize the neural network weights with random numbers. After that, they feed it the input data and determine if the output is similar to the one they want. If it is not, then they update the weights of the neurons until the output is the closest to what the training data shows.

For example, let’s say you want to classify some images as containing a hot dog and others as not containing a hot dog. To do that, you feed the neural network images containing hot dogs and others that do not, which is the training data. Following the initial training, the neural network is then fed new images and needs to determine if they contain a hot dog or not.

These input images are composed of a matrix of numbers, representing each pixel. The neural network goes through the image, and each neuron applies matrix multiplication, using the internal weights, to the numbers in the image, generating a new image. The outputs of the neurons are a stack of lower resolution images, which are then multiplied by the neurons on the next layer. On the final layer, a number comes out representing the solution. In this case, if it is positive, it means that the image contains a hot dog, and if it is negative, it means that it does not contain a hot dog.

The problem is that the weights are not defined in the beginning. The process of finding the weights, known as training the network, that produce a positive number for images that contain a hot dog and a negative number for those that do not is non-trivial. Because there are many weights in a neural network, it takes a long time to find the correct ones for all the neurons in a way that all the images are classified correctly. Simply too many possibilities exist. Additionally, depending on the input set, the network can become overtrained to the specific dataset, meaning it focuses too narrowly on the dataset and cannot generalize to recognize images outside of it.

The complete process of training the network relies on passing the input data through the network multiple times. Each pass takes the output from the previous one to make adjustments in future passes. Each passes’ output is used to provide feedback to improve the algorithm through backpropagation.

One of the reasons why backpropagation took so long to be developed was that the function required computers to perform multiplication, which they were pretty bad at in the 1960s and 1970s. At the end of the 1970s, one of the most powerful processors, the Intel 8086, could compute less than one million instructions per second.* For comparison,* the processor running on the iPhone 12 is more than one million times more powerful than that.*

Figure: Geoffrey Hinton, who founded the field of deep learning.

Widespread Application

Deep learning only really took off in 2012, when Hinton and two of his Toronto students showed that deep neural networks, trained using backpropagation, beat state-of-the-art systems in image recognition by almost halving the previous error rate. Because of his work and dedication to the field, Hinton’s name became almost synonymous with the field of deep learning. He now has more citations than the next top three deep learning researchers combined.

After this breakthrough, deep learning started being applied everywhere, with applications including image classification, language translation, and text-to-speech comprehension as is used by Siri, for example. Deep learning models can improve any task that can be addressed by heuristics, those techniques that are applied to solve some tasks that were previously defined by human experience or thought, including games like Go, chess, and poker as well as activities like driving cars. Deep learning will be used more and more to improve the performance of computer systems with tasks like by figuring out the order that processes should run in or what data should remain in a cache. These tasks can all be much more efficient with deep learning models. Storage will be a big application of it, and in my opinion, the use of deep learning will continue to grow.

It is not a coincidence that deep learning took off and performed better than most of the state-of-the-art algorithms: multilayer neural networks have two very important qualities.*

First, they express the kind of very complicated functions needed to solve problems that we need to address. For example, if you want to understand what is going on with images, you need a function that retrieves the pixels and applies a complicated function that translates them into text or its representation to human language. Second, deep learning can learn from just processing data, rather than needing a feedback response. These two qualities make it extremely powerful since many problems, like image classification, require a lot of data.

The reason why deep neural networks are as good as they are is that they are equivalent to circuits, and a neuron can easily implement a Boolean function. For that reason, a deep enough network can simulate a computer given a sufficient number of steps. Each part of a neural network simulates the simplest part of a processor. That means that deep neural networks are as powerful as computers and, when trained correctly, can simulate any computer program.

Currently, deep learning is a battleground between Google, Apple, Facebook, and other technology companies that aim to serve individuals’ needs in the consumer market. For example, Apple uses deep learning to improve its models for Siri, and Google for its recommendation engine on YouTube. Since 2013, Hinton has worked part-time at Google in Mountain View, California, and Toronto, Canada. And, as of this writing in 2021, he is the lead scientist on the Google Brain team, arguably one of the most important AI research organizations in the world.

Convolutional Neural Networks

There are many types of deep neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs), and each has different properties. For example, recurrent neural networks are deep neural networks in which neurons in higher layers connect back to the neurons in lower layers. Here, we’ll focus on convolutional neural networks, which are computationally more efficient and faster than most other architectures.* They are extremely relevant as they are used for state-of-the-art text translation, image recognition, and many other tasks.

Figure: A recurrent neural network, where one of the neurons feeds back into a previous layer.

The first time Yann LeCun revolutionized artificial intelligence was a false start.* By 1995, he had dedicated almost a decade to what was considered a bad idea according to many computer scientists: that mimicking some features of the brain would be the best way to make artificial intelligence algorithms better. But LeCun finally demonstrated that his approach could produce something strikingly smart and useful.

At Bell Labs, LeCun worked on software that simulated how the brain works, more specifically, how the visual cortex functions. Bell Labs, a research facility owned by the then gigantic AT&T, employed some of the eminent computer scientists of the era. One of the Unix operating systems, which became the basis for Linux, macOS, and Android, was developed there. Not only that, but the transistor, the base of all modern computer chips, as well as the laser and two of the most widely used programming languages to date, C and C++, were also developed there. It was a hub of innovation, so it was not a coincidence that one of the most important deep learning architectures was born in the same lab.

Figure: An image of the primary visual cortex.

LeCun based his work on research done by Kunihiko Fukushima, a Japanese computer researcher.* Kunihiko created a model of artificial neural networks based on how vision works in the human brain. The architecture was based on two types of neuron cells in the human brain called simple cells and complex cells. They are found in the primary visual cortex, the part of the brain that processes visual information.

Simple cells are responsible for detecting local features, like edges. Complex cells pool the results that simple cells produce within an area. For example, a simple cell may detect an edge that may represent a chair. Complex cells aggregate that information by informing the next higher level what the simple cells detected in the layer below.

The architecture of a CNN is based on a cascading model of these two types of cells, and it is mainly used for pattern recognition tasks. LeCun produced the first piece of software that could read handwritten text by looking at many different examples using this CNN model. With this work, AT&T started selling the first machines capable of reading handwriting on checks. For LeCun, this marked the beginning of a new era where neural networks would be used in other fields of AI. Unfortunately, it was not to be.

Figure: Yann LeCun, head of Facebook AI Research.

The same day that LeCun celebrated the launch of bank machines that could read thousands of checks per hour, AT&T announced it was splitting into three different companies, the result of an antitrust lawsuit by the US government. At that point, LeCun became the head of research at a much smaller AT&T and was directed to work on other things. In 2002, he left and eventually became head of the Facebook AI Research group.

LeCun continued working in neural networks, especially in convolutional neural networks, and slowly the rest of the machine learning world came around to the technology. In 2012, some of his students published a paper that demonstrated using CNNs to classify real-world house numbers better than all previous algorithms had been able to do. Since then, deep neural networks have exploded in use, and now most of the research developed in machine learning focuses on deep learning. Convolutional neural networks spread widely and have been used to beat most of the other algorithms for many applications, including natural language processing and image recognition.

The efforts of the team paid off. In 2017, every photo uploaded to Facebook was processed by multiple CNNs. One of them identified which people were in the picture, and another determined if there were objects in the picture. At that time, around 800 million photos were uploaded per day, so the throughput of the CNNs was impressive.

How Convolutional Neural Networks Work

A convolutional neural network (or CNN), is a multilayer neural network. It is named as such because it contains hidden layers that perform convolutions. A convolution is a mathematical function that is the integral of the product of the two functions after one is reversed and shifted. For images, it means that you are running filters on the whole image and producing images with those filters.

Most notably, most inputs to CNNs consist of images.*

On the layers performing convolution, each neuron walks through the image, multiplying the number representing each pixel by the corresponding weight in the neuron, generating a new image as the output.

Let’s examine how a convolutional neural network classifies images. First, we need to make an image of something that a neural network can work with. An image is just data. We represent each pixel of the image as a number; in a black-and-white image, this can indicate how black that pixel is. The figure below represents the number 8. In this representation, 0 is white, and 255 is completely black. The closer the number is to 255, the darker the pixel is.

Figure: An image of eight (left) and the representation of eight in numbers (right).

Figure: The image on the top represents a regular neural network, while the bottom image represents a CNN. Every layer of a CNN transforms the 3D input volume into a 3D output volume.

Think of each neuron as a filter that goes through the entire image. Each layer may have multiple neurons. The figure below shows two neurons walking through the entire image. The red neuron first walks through the image, and the green neuron does the same producing a new resulting image.

The resulting images can go directly to the next layer of the neural network and are processed by those neurons. The processed images in one layer can also be processed by a method called pooling before going to the next layer. The function of pooling is to simplify the results from the previous layers. This may consist of getting the maximum number that the pixels represent in a certain region (or neighborhood) or summing up the numbers in a neighborhood. This is done in multiple layers. When a neuron runs through an image, the next layer produces a smaller image by truncating data. This process is repeated over and over through successive layers. In the end, the CNN produces a list of numbers or a single number, depending on the application.

Figure: The image on the left shows what pooling looks like. The image on the right represents how one of the neurons filters and generates new images based on the input, that is, the convolution operation.

Based on the result, the image can then be classified based on what the system is looking for. For example, if the resulting number is positive, the image can be classified as containing a hot dog, and if the resulting number is negative, then the image is classified as not containing a hot dog. But this assumes that we know what each neuron looks like, that is, what the filter looks like for every layer. In the beginning, the neurons are completely random, and by using the backpropagation technique, the neurons are updated in such a way that they produce the desired result.

Figure: An image of a cat that goes through a multilayer neural network. In the last step of this neural network, a number comes out. If it is positive, then the neural network classifies the image as a cat. If it is negative, it classifies the image as a dog.

A CNN is trained by showing it many images tagged with their results—the labels. This set is called the training data. The neural network updates its weights, based on whether it classifies the images properly or not, using the backpropagation algorithms. After the training stage, the resulting neural network is the one used to classify new images. Even though CNNs were created based on how the visual cortex works, they can also be used in text, for example. To do that, the inputs are translated to a matrix to match the format of an image.

There is a misconception that deep neural networks are a black box, that is, that there is no way of knowing what they are doing. The thing is that there is no way of determining for every input, either image, sound, or text, what the resulting output is, or if the network is going to classify it correctly. But that does not mean that there is no way of determining what each layer does in a neural network.

Figure: How the filters look (gray) for a CNN classifying objects and the corresponding images that activate these filters. The filters in these images in Layer 1 detect edges and, in Layer 2, detect waves and other patterns. Visualizations of Layer 1 and 2. Each layer illustrates 2 pictures, one which shows the filters themselves and one that shows the parts of the image that are most strongly activated by the given filter. For example, in the space labeled Layer 2, we have representations of the 16 different filters (on the left).

In fact, for CNNs, you see what the filters look like and what kind of images activate each layer. The weights in each neuron can be interpreted as pictures. The figure above shows the filters at different layers and also some examples of images that activate these layers. For example, in the first layer of a multilayer CNN, the filters, or weights, for the neurons look like edges. That means that the filters will activate when edges are found. The second layer of filters shows that the types of images that they activate are a little more complex, with eyes, curves, and other shapes. The third layer activates with images such as wheels, profiles of people, birds, and faces. That means that at each layer of the neural network, more complex images are filtered. The first layer filters and passes the information to the next layer that says if an area contains edges or not. Then, the next layer uses that information, and from the detected edges, it will try to find wheels and so forth. The last layer will identify the categories that humans want to know about: it will identify, for example, whether the image contains a cat, hot dog, or human.

Google Brain: The First Large-Scale Neural Network

The brain sure as hell doesn’t work by somebody programming in rules.Geoffrey Hinton*

Google Brain started as a research project between Google employees Jeff Dean and Greg Corrado and Stanford Professor Andrew Ng in 2011.* But Google Brain turned into much more than simply a project. By acquiring companies such as DeepMind and key AI personnel like Geoffrey Hinton, Google has become a formidable player in advancing this field.

One of the early key milestones of deep neural networks resulted from the initial research led by Ng when he decided to process YouTube videos and feed them to a deep neural network.* Over the course of three days, he fed 10 million YouTube videos* to 1,000 computers with 16 cores each, using the 16,000 computer processors as a neural network to learn the common features in these videos. After being presented with a list of 20,000 different objects, the system recognized pictures of cats and around 3,000 other objects. It started to recognize 16% of the objects without any input from humans.

The same software that recognized cats was able to detect faces with 81.7% accuracy and human body parts with 76.7% accuracy.* With only the data, the neural network learned to recognize images. It was the first time that such a massive amount of data was used to train a neural network. This would become the standard practice for years to come. The researchers made an interesting observation, “It is worth noting that our network is still tiny compared to the human visual cortex, which is times larger in terms of the number of neurons and synapses.”*

DeepMind: Learning from Experience

Demis Hassabis was a child prodigy in chess, reaching the Master standard at age 13, the second highest-rated player in the World Under-14 category, and also “cashed at the World Series of Poker six times including in the Main Event.”* In 1994 at 18, he began his computer games career co-designing and programming the classic game Theme Park, which sold millions of copies.* He then became the head of AI development for an iconic game called Black & White at Lionhead Studios. Hassabis earned his PhD from the University College London in cognitive neuroscience in 2009.

Figure: Demis Hassabis, CEO of DeepMind.

In 2010, Hassabis co-founded DeepMind in London with the mission of “solving intelligence” and then using that intelligence to “solve everything else.” Early in its development, DeepMind focused on algorithms that mastered games, starting with games developed for Atari.* Google acquired DeepMind in 2014 for $525M.

DeepMind Plays Atari

Figure: Breakout game.

To help the program play the games, the team at DeepMind developed a new algorithm, Deep Q-Network (DQN), that learned from experience. It started playing games like the famous Breakout game, interpreting the video and producing a command on the joystick. If the command produced an action where the player scored, then the learning software reinforced that action. The next time it played the game, it would likely do the same action. It is reinforcement learning, but with a deep neural network to determine the quality of a state-action combination. The DNN helps determine which action to take given the state of the game, and the algorithm learns over time after playing a few games and determining the best actions to take at each point.

Figure: Games that DeepMind’s software played on Atari.* The AI performed better than human level at the ones above the line.

For example, in the case of Breakout,* after playing a hundred games, the software was still pretty bad and missed the ball often. But it kept playing, and after a few hours—300 games—the software improved and played at human ability. It could return the ball and keep it alive for a long time. After they let it play for a few more hours—500 games—it became better than the average human, learning to do a trick called tunneling, which involves systematically sending the ball to the side walls so that it bounces around on top, requiring less work and earning more reward. The same learning algorithm worked not only on Breakout but also for most of the 57 games that DeepMind tried the technique on, achieving superhuman level for most of them.

Figure: Montezuma’s Revenge.

The learning algorithm, however, did not perform well for all games. Looking at the bottom of the list, the software got a score of zero on Montezuma’s Revenge. DeepMind’s DQN software does not succeed in this game because the player needs to understand high-level concepts that people learn throughout their lifetime. For example, if you look at the game, you know that you are controlling the character and that ladders are for climbing, ropes are for swinging, keys are probably good, and the skull is probably bad.

Figure: Montezuma’s Revenge (left) and the teacher and student neural networks (right)

DeepMind improved the system by breaking the problem into simpler tasks. If the software could solve things like “jump across the gap,” “get to the ladder,” and “get past the skull and pick up the key,” then it could solve the game and perform well at the task. To attack this problem, DeepMind created two neural networks—the teacher and the student. The teacher is responsible for learning and producing these subproblems. The teacher sends these subproblems to another neural network called the student. The student takes actions in the game and tries to maximize the score, but it also tries to do what the teacher tells it. Even though they were trained with the same data as the old algorithm, plus some additional information, the communication between the teacher and the student allowed strategy and communication to emerge over time, helping the agent learn how to play the game.

AlphaGo: Defeating the Best Go Players

In the Introduction, we discussed the Go competition between Lee Sedol and AlphaGo. Well, DeepMind developed AlphaGo with the goal of playing Go against the Grandmasters. October 2015 was the first time that software beat a human at Go, a game with has around positions, more possible positions than the number of moves in chess or even the total number of atoms in the universe (around ). In fact, if every atom in the universe were a universe itself, there would be fewer atoms than the number of positions in a Go game.

In many countries such as South Korea and China, Go is considered a national game, like football and basketball are in the US, and these countries have many professional Go players, who train from the age of 6.* If these players show promise in the game, they switch from a normal school to a special Go school where they play and study Go for 12 hours a day, 7 days a week. They live with their Go Master and other prodigy children. So, it is a serious matter for a computer program to challenge these players.

There are around 2,000 professional Go players in the world, along with roughly 40 million casual players. In an interview at the Google Campus,* Hassabis shared, “We knew that Go was much harder than chess.” He describes how he initially thought of building AlphaGo the same way that Deep Blue was built, that is, by building a system that did a brute-force search with a handcrafted set of rules.

But he realized that this technique would never work since the game is very contextual, meaning there was no way to create a program that could determine how one part of the board would affect other parts because of the huge number of possible states. At the same time, he realized that if he created an algorithm to beat the Master players in Go, then he would probably have made a significant advance in AI, more meaningful than Deep Blue.

Go is not only hard because the game has an astronomical number of possibilities, but for a program to be good at playing Go, it needs to determine the best next move. To figure that out, the software needs to determine if a position is good or not. It cannot play all possibilities until the end because there were too many. Conventional wisdom thought it impossible to determine the value of a certain game state for Go. It’s much simpler to do this for chess. For example, you can codify things like pawn structure and piece mobility, which are techniques Grandmasters use to determine if a position is good or bad. In Go, on the other hand, all pieces are the same. Even a single stone can modify the outcome of the game, so each one has a profound impact on the game.

What makes Go even harder is that it is a constructive game as opposed to a destructive one. In chess, you start with all the pieces on the board and take them away as you play, making the game simpler. The more you play, the fewer the possibilities there are for the next moves. Go, however, begins with an empty board, and you add pieces, making it harder to analyze. In chess, if you analyze a complicated middle game, you can evaluate the current situation, and that tells everything. To analyze a middle game in Go, you have to project into the future to examine the current situation of the board, which makes it much harder to analyze. In reality, Go is more about intuition and instinct rather than calculation like chess.

When describing the algorithm that AlphaGo produced, Hassabis said that it does not merely regurgitate human ideas and copy them. It genuinely comes up with original ideas. According to him, Go is an objective art because anyone can come up with an original move, but you can measure if that move or idea was pivotal for winning the game. Even though Go has been played at a professional level for 3,000 years, AlphaGo created new techniques and directly influenced how people played because AlphaGo was strategic and seemed human.

The Workings of AlphaGo

Figure: The Policy Network. Probability Distribution over moves. The darker the green squares, the higher the probability.

DeepMind developed AlphaGo mainly with two neural networks.* The first, the Policy Network, calculates the probability of the next move for a professional player given the state of the board. Instead of looking for the next 200 moves, the Policy Network only looks at the next five to ten. By doing that, it reduces the number of positions AlphaGo has to search in order to find what is the next best move.

Figure: The Value Network. How the position evaluator sees the board. Darker blue represents places where the next stone leads to a more likely win for the player.

Initially, AlphaGo was trained using around 100,000 Go games from the internet. But once it could narrow down the search for the next move, AlphaGo played against itself millions of times and improved through reinforcement learning. The program learned through its own mistakes. If it won, it would make it more likely to make those moves in the next game. With that information, it created its own database of millions of games of the system playing against itself.

Then, DeepMind trained a second neural network, the Value Network, which calculated the probability of a player winning based on the state of the board. It outputs 1 for black winning, 0 for white, and 0.5 for a tie. Together, these two networks turned what seemed an intractable problem into a manageable one. They transformed the problem of playing Go into a similar problem of solving a chess game. The Policy Network provides the ten next probable moves, and the Value Network gives the score of a board state. Given these two networks, the algorithm to find the best moves, the Monte Carlo Tree Search—similar to the one used to play chess, i.e., the min-max search described earlier—uses the probabilities to explore the space of possibilities.

The Match

AlphaGo played the best player in the world, Lee Sedol, a legend of the game who had won over 18 world titles. At stake was $1M of bounty.

Figure: AlphaGo’s first 99 moves in game 2. AlphaGo controls the black piece.

AlphaGo played unimaginable moves, changing the way people would play Go in the years ahead. The European champion, Fen Hui, told Hassabis that his mind was free from the shackles of tradition after seeing the innovative plays from AlphaGo; he now considered unthinkable thoughts and was not constrained by the received wisdom that had bound him for decades.

“‘I was deeply impressed,’ Ke Jie said through an interpreter after the game. Referring to a type of move that involves dividing an opponent’s stones, he added: ‘There was a cut that quite shocked me, because it was a move that would never happen in a human-to-human Go match.’”* Jie, the current number one Go player in the world, stated, “Humanity has played Go for thousands of years, and yet, as AI has shown us, we have not even scratched the surface.” He continued, “The union of human and computer players will usher in a new era. Together, man and AI can find the truth of Go.”*

AlphaGo Zero

The DeepMind team did not stop after winning against the best player in the world. They decided to improve the software, so they merged the two deep neural networks, the Policy and Value Networks, creating AlphaGo Zero. “Merging these functions into a single neural network made the algorithm both stronger and much more efficient,” said David Silver, the lead researcher on AlphaGo. AlphaGo Zero removed a lot of redundancies between the two neural networks. The probability of adding a piece to a position (Policy Network) contains the information of who might win the game (Value Network). If there is only one position for the player to place a piece, it might mean that the player is cornered and has a low chance of winning. Or, if the player can place a piece anywhere on the board, that probably means that the player is winning because it does not matter where it places the next piece on the board. And because merging the networks removed these redundancies, the resulting neural network was smaller, making it much less complex and, therefore, more efficient. The network had less weights and required less training to figure out. “It still required a huge amount of computing power—four of the specialized chips called tensor processing units (TPUs), which Hassabis estimated to be US$25M of hardware. But its predecessors used ten times that number. It also trained itself in days rather than months. The implication is that ‘algorithms matter much more than either computing or data available’, said Silver.”*

Critics point out that what AlphaGo does is not actually learning because if you were to make the agent play the game with the same rules but changed the color of the pieces, then the program would be confused and would perform terribly. Humans can play well if there are minor changes to the game. But that can be solved by training the software with different games as well as different colored pieces. DeepMind made its algorithms generalize for other games like Atari by using a technique called synaptic consolidation. Without this new method, when an agent first learns to play the game, the agent saturates the neural connections with the knowledge on how to play the first game. Then, when the agent starts learning how to play a variation of the game or a different game, all the connections are destroyed in order to learn how to play the second game, producing catastrophic forgetting.

If a simple neural network is trained to play Pong controlling the paddle on the left, then after it learns how to play the game, the agent will always win, obtaining a score of 20 to 0. If you change the color of the paddle from green to black, the agent is still controlling the same paddle, but it will miss the ball every time. It ends up losing 20 to 0, showing a catastrophic failure. DeepMind solved this using inspiration from how a mouse’s brain works, and presumably how the human brain works as well. In the brain, there is a process called synaptic consolidation. It is the process in which the brain protects only neural connections that form when a particular new skill is learned.

Figure: AlphaGo Zero Elo rating over time. At 0 days, AlphaGo Zero has no prior knowledge of the game and only the basic rules as an input. At 3 days, AlphaGo Zero surpasses the abilities of AlphaGo Lee, the version that beat world champion Lee Sedol in 4 out of 5 games in 2016. At 21 days, AlphaGo Zero reaches the level of AlphaGo Master, the version that defeated 60 top professionals online and world champion Ke Jie in 3 out of 3 games in 2017. At 40 days, AlphaGo Zero surpasses all other versions of AlphaGo and, arguably, becomes the best Go player in the world. It does this entirely from self-play, with no human intervention and using no historical data.

Taking inspiration from biology, DeepMind developed an algorithm that does the same. After it plays a game, it identifies and protects only the neural connections that are the most important for that game. That means that whenever the agent starts playing a new game, spare neurons can be used to learn a new game, and at the same time, the knowledge that was used to play the old game is preserved, eliminating catastrophic forgetting. With this technique, DeepMind showed that the agents could learn how to play ten Atari games without forgetting how to play the old ones at superhuman level. The same technique can be used to make AlphaGo learn to play different variations of the Go game.


OpenAI, a research institute started by tech billionaires including Elon Musk, Sam Altman, Peter Thiel, and Reid Hoffman, had a new challenge. OpenAI was started to advance artificial intelligence and prevent the technology from turning dangerous. In 2016, led by CTO and co-founder, Greg Brockman, they started looking on Twitch, an internet gaming community, to find the most popular games that had an interface a software program could interact with and that ran on the Linux operating system. They selected Dota 2. The idea was to make a software program that could beat the best human player as well as the best teams in the world—a lofty goal. By a wide margin, Dota 2 would be the hardest game at which AI would beat humans.

At first glance, Dota 2 may look less cerebral than Go and chess because of its orcs and creatures. The game, however, is much more difficult than those strategy games because the board itself and the number of possible moves is much greater than the previous games. Not only that, but there are around 110 heroes, and each one has at least four moves. The average number of possible moves per turn for chess is around 20 and for Go, about 200. Dota 2 has on average 1,000 possible moves for every eighth of a second, and the average match lasts around 45 minutes. Dota 2 is no joke.

Figure: OpenAI team gathered at its headquarters in San Francisco, California.

Its first challenge was to beat the best humans in a one-on-one match. That meant that a machine only played against a single human controlling their own creature. That is a much easier feat than a computer playing versus a team of five humans (5v5) because playing one-on-one is a more strategic game and a much tougher challenge for humans. With a team, one player could defend and another attack at the same time, for example.

To beat the best humans at this game, OpenAI developed a system called Rapid. In this system, AI agents played against themselves millions of times per day, using reinforcement learning to train a multilayer neural network, an LSTM specifically, so that it could learn the best moves over time. With all this training, by August of the following year, OpenAI agents defeated the best Dota 2 players in one-on-one matches and remained undefeated against them.

Figure: OpenAI Matchmaking rating showing the program’s skill level for Dota 2 over time.

OpenAI then focused on the harder version of the game: 5v5. The new agent was aptly named OpenAI Five. To train it, OpenAI again used its Rapid system, playing the equivalent of 180 years of Dota 2 per day. It ran its simulations using the equivalent of 128,000 computer processors and 256 GPUs. Training focused on updating an enormous long short-term memory network with the games and applying reinforcement learning. OpenAI Five takes an impressive amount of input: 20,000 numbers representing the board state and what the players are doing at any given moment.*

Figure: The Casper Match at which OpenAI beat the Casper team.

In January of 2018, OpenAI started testing its software against bots, and its AI agent was already winning against some of them. By August of the same year, OpenAI Five beat some of the best human teams in a 5v5 match, albeit with some limitations to the rules.

So, the OpenAI team decided to play against the best team in the world at the biggest Dota 2 competition, The International, in Vancouver, Canada. At the end of August, the AI agents played against the best team in the world in front of a huge audience in a stadium, and hundreds of thousands of people watched the game through streaming. In a very competitive match, it lost against the human team. But in less than a year, in April 2019, OpenAI Five beat the best team in the world in twin back-to-back games at the International.*

Software 2.0

Computer languages of the future will be more concerned with goals and less with procedures specified by the programmer.Marvin Minsky*

The Software 2.0 paradigm started with the development of the first deep learning language, TensorFlow.

When creating deep neural networks, programmers write only a few lines of code and make the neural network learn the program itself instead of writing code by hand. This new coding style is called Software 2.0 because before the rise of deep learning, most of the AI programs were handwritten in programming languages such as Python and JavaScript. Humans wrote each line of code and also determined all of the program’s rules. In contrast, with the emergence of deep learning techniques, the new way that coders program these systems is by either specifying the goal of the program, like winning a Go game, or by providing data with the proper input and output, such as feeding the algorithm pictures of cats with “cat” tags and other pictures without cats with “not cat” tags.

Based on the goal, the programmer writes the skeleton of the program by defining the neural network architecture(s). Then, the programmer uses the computer hardware to find the exact neural network that best performs the specified goal and feeds it data to train the neural network. With traditional software, Software 1.0, most programs are stored as programmer-written code that can span thousands to a billion lines of code. For example, Google’s entire codebase has around two billion lines of code.* But in the new paradigm, the program is stored in memory as the weights of the neural architecture with few lines of code written by programmers. There are disadvantages to this new approach: software developers sometimes have to choose between using software that they understand but only works 90% of the time or a program that performs well in 99% of the cases but it is not as well understood.


Some languages were created only for writing Software 2.0, that is, programming languages to help build, train, and run these neural networks. The most well-known and widely used one is TensorFlow. Developed by Google and released internally in 2015, it now powers Google products like Smart Reply and Google Photos, but it was also made available for external developers to use. It is now more popular than the Linux operating system by some metrics. It became widely used by developers, startups, and other big companies for all types of machine learning tasks, including translation from English into Chinese and reading handwritten text. TensorFlow is used to create, train, and deploy a neural network to perform different tasks. But to train the resulting network, the developer must feed it data and define the goal the neural network optimizes for. This is as important as defining the neural network.

The Importance of Good Training Data

Because a large part of the program is the data fed to it, there is growing concern that datasets used for these networks represent all possible scenarios that the program may run into. The data has become essential for the software to work as expected. One of the problems is that sometimes the data might not represent all use cases that a programmer wants to cover when developing the neural network. And, the data might not represent the most important scenarios. So, the size and variety of the dataset have become more and more important in order to have neural networks that perform as expected.

For example, let’s say that you want a neural network that creates a bounding box around cars on the road. The data needs to cover all cases. If there is a reflection of a car on a bus, then the data should not have it labeled as a car on the road. For the neural network to learn that, the programmer needs to have enough data representing this use case. Or, let’s say that five cars are in a car carrier. Should the software create a bounding box for each of the automobiles or just for the car carrier? Either way, the programmer needs enough examples of these cases in the dataset.

Another example is if the car’s training data comes with a lot of data gathered in certain lighting conditions or with a specific vehicle. Then, if those same algorithms encounter a vehicle with a different shape or in different lighting, the algorithm may behave differently. One example that happened to Tesla was when the self-driving software was engaged and the software didn’t notice the trailer in front of the car. The white side of the tractor trailer against a brightly lit sky was hard to detect. The crash resulted in the death of the driver.*

Labeling, that is, creating the dataset and annotating it with the correct information, is an important iterative process that takes time and experience to make work correctly. The data needs to be captured and cleaned. It is not something done once and then it is complete. Rather, it is something that evolves.

The Japanese Cucumber Farmer

TensorFlow has not only been used by developers, startups, and large corporations, but also by individuals. One surprising story is of a Japanese cucumber farmer. An automotive engineer, Makoto Koike, helped his parents sort cucumbers by size, shape, color, and other attributes on their small family farm in the small city of Kosai. For years, they sorted their pickles manually. It happens that cucumbers in Japan have different prices depending on their characteristics. For example, more colorful cucumbers and ones with many prickles are more expensive than others. Farmers pull aside the cucumbers that are more expensive so that they are paid fairly for their crop.

The problem is that it is hard to find workers to sort them during harvest season, and there are no machines sold to small farmers to help with the cucumber sorting. They are either too expensive or do not provide the capabilities small farms need. Makoto’s parents separated the cucumbers by hand, which is as hard as growing them and takes months. Makoto’s mother used to spend eight hours per day sorting them. So, in 2015, after seeing how AlphaGo defeated the best Go players, Makoto had an idea. He decided to use the same programming language, TensorFlow, to develop a cucumber-sorting machine.

To do that, he snapped 7,000 pictures of cucumbers harvested on his family’s farm. Then, he tagged the pictures with the properties that each cucumber had, adding information regarding their color, shape, size, and whether they contained prickles. He used a popular neural network architecture and trained it with the pictures that he took. At the time, Makoto did not train the neural network with the computer servers that Google offered because they charged by time used. Instead, he trained the network using his low-power desktop computer. Therefore, to train his tool in a timely manner, he converted the pictures to a smaller size of 80x80 pixels. The smaller the size of the images, the faster it is to train the neural network because the neural network is smaller as well. But even with the low resolution, it took him three days to train his neural network.

After all the work, “when I did a validation with the test images, the recognition accuracy exceeded 95%. But if you apply the system with real-use cases, the accuracy drops to about 70%. I suspect the neural network model has the issue of ‘overfitting,’” Makoto stated.*

Overfitting, also called overtraining, is the phenomenon when a machine learning model is created and only works for the training data.

Makoto created a machine that was able to help his parents sort the cucumbers into different shapes, color, length, and level of distortion. It was not able to figure out if the cucumbers had many prickles or not because of the low-resolution images used for training. But the resulting machine turned out to be very helpful for his family and cut out the time that they spent manually sorting their produce.

The same technology built by one of the largest companies in the world and used to power its many products was also used by a small farmer on the other side of the globe. TensorFlow democratized access so many people could develop their own deep learning models. It will not be surprising to find many more “Makotos” out there.

Bugs 2.0

In Software 1.0, problems—called bugs—happened mostly because a person wrote logic that did not account for edge cases or handle all the possible scenarios. But in the Software 2.0 stack, bugs are much different because the data may confuse the neural network.

One example of such a bug was when the autocorrect for iOS started using a weird character “# ?” to replace the word “I” when sending a message. The operating system mistakenly autocorrected the spelling of I because, at some point, the data it received taught it to do so. The model learned that “# ?” was the correct spelling according to the data. As soon as someone sent “I,” the model thought it was important to fix and replaced it everywhere it could. The bug spread like a virus, reaching millions of iPhones. Given how fast and important these bugs can be, it is extremely important that the data as well as the programs are well tested, making sure that these edge cases do not make programs fail.

Dueling Neural Networks

What I cannot create, I do not understand.Richard Feynman*

One of the past decade’s most important developments in deep learning is generative adversarial networks, developed by Ian Goodfellow. This new technology can also be used for ill intent, such as for generating fake images and videos.

Generative Adversarial Networks

Figure: GANs generated by a computer. The above images look real, but more than that, they look familiar.* They resemble a famous actress or actor that you may have seen on television or in the movies. They are not real, however. A new type of neural network created them.

GAN, or generative adversarial network, is a class of machine learning framework where two neural networks play a cat and mouse game. One creates fake images that look like the real ones fed into it, and the other decides if they are real.

Generative adversarial networks (GANs), sometimes called generative networks, created these fake images. The Nvidia research team used this new technique by feeding thousands of photos of celebrities to a neural network. The neural network has in turn produced thousands of pictures, like the ones above, that resemble the famous faces. They look real, but machines created them. GANs allow researchers to build images that look like the real ones by sharing many features of the images the neural network was fed. It can be fed photographs of objects from tables to animals, and after being trained, it produces pictures that resemble the originals.

Figure: Of the two images above, can you tell the real from the fake?*

For the Nvidia team to generate these images, it set up two neural networks. One that produced the pictures and the other that determined if they were real or fake. Combining these two neural networks produced a GAN, or generative adversarial network. They play a cat and mouse game, where one creates fake images that look like the real ones fed into it, and the other decides if they are real. Does this remind you of anything? The Turing test. Think of the networks as playing the guessing game of whether the images are real or fake.

After the GAN has been trained, one of the neural networks creates fake images that look like the real ones used in training. The resulting pictures look exactly like real peoples’ pictures. This technique can generate large amounts of fake data that can help researchers predict the future or even construct simulated worlds. That is why for Yann LeCun, Director of Facebook AI Research, “Generative Adversarial Networks is the most interesting idea in the last ten years in machine learning.”* GANs will be helpful for creating images and maybe creating software simulations of the real world, where developers can train and test other types of software. For example, companies writing self-driving software for cars can train and check their software in simulated worlds. I discuss this in detail later in this book.

These simulated worlds and situations are now handcrafted by developers, but some believe that these scenarios will all be created by GANs in the future. GANs generate new images and videos from very compressed data. Thus, you could use a GAN’s two neural networks to save data and then reinstate it. Instead of zipping your files, you could use one neural network to compress it and the other to generate the original videos or images. It is no coincidence that in the human brain some of the apparatus used for imagination is the same as the one used for memory recall. Demis Hassabis, the founder of DeepMind, published a paper* that “showed systematically for the first time that patients with damage to their hippocampus, known to cause amnesia, were also unable to imagine themselves in new experiences.”* The finding established a link between the constructive process of imagination* and the reconstructive process of episodic memory recall.* There are more details regarding this later in this book.

Figure: Increasingly realistic synthetic faces generated by variations on generative adversarial networks through the years.

The Creator of GANs

Ian Goodfellow, the creator of GANs, came up with the idea at a bar in Montreal when he was with fellow researchers discussing what goes into creating photographs. The initial plan was to understand the statistics that determined what created photos, and then feed them to a machine so that it could produce the pictures. Goodfellow thought that the idea would never work because there are too many statistics needed. So, he thought about using a tool, a neural network. He could teach neural networks to figure out the underlying characteristics of the pictures fed to the machine and then generate new ones.

Figure: Ian Goodfellow, creator of generative adversarial networks.

Goodfellow then added two neural networks so that they could together build realistic photographs. One created fake images, and the other determined if they were real. The idea was that one of the adversary networks would teach the other how to produce images that could not be distinguished from the real ones.

On the same night that he came up with the idea, he went home, a little bit drunk, and stayed up that night coding the initial concept of a GAN on his laptop. It worked on the first try. A few months later, he and a few other researchers published the seminal paper on GANs at a conference.* The trained GAN used handwritten digits from a well-known training image set called MNIST.*

In the following years, hundreds of papers were published using the idea of GANs to produce not only images but also videos and other data. Now at Google Brain, Goodfellow leads a group that is making the training of these two neural networks very reliable. The result from this work is services that are far better at generating images and learning sounds, among other things. “The models learn to understand the structure of the world,” Goodfellow says. “And that can help systems learn without being explicitly told as much.”

Figure: Synthetically generated word images.*

GANs could eventually help neural networks learn with less data, generating more synthetic images that are then used to identify and create better neural networks. Recently, a group of researchers at Dropbox improved their mobile document scanner by using synthetically generated images. GANs produced new word images that, in turn, were used to train the neural network.

And, that is just the start. Researchers believe that the same technique can be applied to develop artificial data that can be shared openly on the internet while not revealing the primary source, making sure that the original data stays private. This would allow researchers to create and share healthcare information without sharing sensitive data about patients.

GANs also show promise for predicting the future. It may sound like science fiction now, but that might change over time. LeCun is working on writing software that can generate video of future situations based on current video. He believes that human intelligence lies in the fact that we can predict the future, and therefore, GANs will be a powerful force for artificial intelligence systems in the future.*

The Birthday Paradox Test

Even though GANs are generating new images and sounds, some people ask if GANs generate new information. Once a GAN is trained on a collection of data, can it produce data that contains information outside of its training data? Can it create images that are entirely different from the ones fed to it?

A way of analyzing that is by what is called the Birthday Paradox Test. This test derives its name from the implication that if you put 23—two soccer teams plus a referee—random people in a room, the chance that two of them have the same birthday is more than 50%.

This effect happens because with 365 days in a year, you need at least a number of people around the square root of that to see a duplicate birthday. The Birthday Paradox says that for a discrete distribution that has support N, then a random sample size of √N would likely contain a duplicate. What does that mean? Let me break it down.

If there are 365 days a year, then you need the square root of 365—√365—people to have a probable chance of two having the same birthdays, which means about 19 people. But this also works for the other side of the equation. If you do not know the number of days in a year, then you can select a fixed number of people and ask them for their birthdays. If there are two people with the same birthday, you can infer the number of days in a year with high probability based on the number of people. If you have 22 people in the room, then the number of days in a year is the square of 22—— about 484 days per year, an approximation of the actual number of days in a year.

The same test can check the size of the original distribution of a GAN’s generated images. If the result reveals that a set of K images contains duplicates with reasonable probability, then you can suspect that the number of original images is about K². So, if a test shows that it is very likely to find a duplicate in a set of 20 images, then the size of the original set of images is approximately 400. This test can be run by selecting subsets of images and checking how often we find duplicates in these subsets. If we find duplicates in more than 50% of the subsets of a certain length, we can use that size for our approximation.

With that test in hand, researchers have shown that images generated by famous GANs do not generalize beyond what the training data provides. Now, what is left to prove is whether GANs can be improved to generalize beyond the training data or if there are ways of generalizing beyond the original images by using other methods to improve the training dataset.

GANs and Bad Intent

There are concerns that people can use the GAN technique with ill intent.* With so much attention on fake media, we could face an even broader range of attacks with fake data. “The concern is that these methods will rise to the point where it becomes very difficult to discern truth from falsity,” said Tim Hwang, who previously oversaw AI policy at Google and is now director of the Ethics and Governance of Artificial Intelligence Fund, an organization supporting ethical AI research. “You might believe that accelerates problems we already have.”*

Even though this technique cannot create still images of high quality, researchers believe that the same technology could produce videos, games, and virtual reality. The work to start generating videos has already begun. Researchers are also using a wide range of other machine learning methods to generate faux data. In August of 2017, a group of researchers at the University of Washington was featured in headlines when they built a system that could put words in Barack Obama’s mouth in a video. An app with this technique is already on the Apple’s App Store.* The results were not completely convincing, but the rapid progress in the area of GANs and other techniques point to a future where it becomes tough for people to differentiate between real videos and generated ones. Some researchers claim that GANs are just another tool like others that can be used for good or evil and that there will be more technology to figure out if the newly created videos and images are real.

Not only that, but researchers have uncovered ways of using GANs to generate audio that sounds like one thing to humans but something else to machines.* For example, you can develop audio that sounds to humans like “Hi, how are you?” and to machines, “Alexa, buy me a drink.” Or, audio that sounds like a Bach symphony to a human, but for the machine, it sounds like “Alexa, go to this website.”

The future has unlimited potential. Digital media may surpass analog media by the end of this decade. We are starting to see examples of this with companies like Synthesia.* Other examples of digital media are encountered with Imma, an Instagram model that is completely generated by computers and has around 350k followers.* It won’t be surprising to see more and more digital media in the world as the cost of creating such content goes down and they can be anything that their creators want.

Data Is the New Oil

Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.Clive Humby*

Data is key to deep learning, and one of the most important datasets, ImageNet, created by Fei-Fei Li, marked the beginning of the field. It is used for training neural networks as well as to benchmark them against others.

Deep learning is a revolutionary field, but for it to work as intended, it requires data.* The term for these large datasets and the work around them is Big Data, which refers to the abundance of digital data. Data is as important for deep learning algorithms as the architecture of the network itself, the software. Acquiring and cleaning the data is one of the most valuable aspects of the work. Without data, neural networks cannot learn.*

Most of the time, researchers can use the data given to them directly, but there are many instances where the data is not clean. That means it cannot be used directly to train the neural network because it contains data that is not representative of what the algorithm wants to classify. Perhaps it contains bad data, like black-and-white images when you want to create a neural network to locate cats in colored images. Another problem is when the data is not appropriate. For example, when you want to classify images of people as male or female. There might be pictures without the needed tag or pictures that have the information corrupted with misspelled words like “ale” instead of “male.” Even though these might seem like crazy scenarios, they happen all the time. Handling these problems and cleaning up the data is known as data wrangling.

Researchers also sometimes have to fix problems with how data is represented. In some places, the data might be expressed one way, and in others the same data can be described in a completely different way. For example, a disease like diabetes might be classified with a certain number (3) in one database and (5) in another. This is one reason for the considerable effort in industries to create standards for sharing data more easily. For example, Fast Healthcare Interoperability Resources (FHIR) was created by the international health organization, Health Level Seven International, to create standards for exchanging electronic health records.

Standardizing data is essential, but selecting the correct input is also important because the algorithm is created based on the data.* And, choosing that data is not easy. One of the problems that can occur when selecting data is that it can be biased in some way, creating a problem known as selection bias. That means that the data used to train the algorithm does not necessarily represent the entire space of possibilities. The saying in the industry is, “Garbage in, garbage out.” That means that if the data entered into the system is not correct, then the model will not be accurate.


Fei-Fei Li, who was the director of the Stanford Artificial Intelligence Laboratory and also the Chief Scientist of AI/ML at Google Cloud, could see data was essential to the development of machine learning algorithms early on,* before many of her colleagues.

Figure: Professor Fei-Fei Li.

Li realized that to make better algorithms and more performant neural networks, more and better data was needed and that better algorithms would not come without that data. At the time, the best algorithms could perform well with the data that they were trained and tested with, which was very limited and did not represent the real world. She realized that for the algorithms to perform well, data needed to resemble actuality. “We decided we wanted to do something that was completely historically unprecedented,” Li said, referring to a small team initially working with her. “We’re going to map out the entire world of objects.”

To solve the problem, Li constructed one of the most extensive datasets for deep learning to date, ImageNet. The dataset was created, and the paper describing the work was published in 2009 at one of the key computer vision conferences, Computer Vision and Pattern Recognition (CVPR), in Miami, Florida. The dataset was very useful for researchers and because of that, it became more and more famous, providing the benchmark for one of the most important annual deep learning competitions, which tested and trained algorithms to identify objects with the lowest error rate. ImageNet became the most significant dataset in the computer vision field for a decade and also helped boost the accuracy of algorithms that classified objects in the real world. In only seven years, the winning algorithms’ accuracy in classifying objects in images increased from 72% to nearly 98%, overtaking the average human’s ability.

But ImageNet was not the overnight success many imagine. It required a lot of sweat from Li, beginning when she taught at the University of Illinois Urbana-Champaign. She was dealing with problems that many other researchers shared. Most of the algorithms were overtraining to the dataset given to them, making them unable to generalize beyond it. The problem was that most of the data presented to these algorithms did not contain many examples, so they did not have enough information about all the use cases for the models to work in the real world. She, however, figured out that if she generated a dataset that was as complex as reality, then the models should perform better.

It is easier to identify a dog if you see a thousand pictures of different dogs, at different camera angles and in lighting conditions, than if you only see five dog pictures. In fact, it is a well-known rule of thumb that algorithms can extract the right features from images if there are around 1,000 images for a certain type of object.

Li started looking for other attempts to create a representation of the real world, and she came across a project, WordNet, created by Professor George Miller. WordNet was a dataset with a hierarchical structure of the English language. It resembled a dictionary, but instead of having an explanation for each word, it had a relation to other words. For example, the word “monkey” is underneath the word “primate,” which is in turn underneath the word “mammal.” In this way, the dataset contained the relation of all the words among others.

After studying and learning about WordNet, Li met with Professor Christiane Fellbaum, who worked with Miller on WordNet. She gave Li the idea to add an image and associate it to each word, creating a new hierarchical dataset based on images instead of words. Li expanded on the idea—instead of adding one image per word, she added many images per word.

As an assistant professor at Princeton, she built a team to tackle the ImageNet project. Li’s first idea was to hire students to find images and add them to ImageNet manually. But she realized that it would become too expensive and take too much time for them to finish the project. From her estimates, it would take a century to complete the work, so she changed strategies. Instead, she decided to get the images from the internet. She could write algorithms to find the pictures, and humans would choose the correct ones. After months working on this idea, she found that the problem with this strategy was that the images chosen were constrained to the algorithms that picked the images. Unexpectedly, the solution came when Li was talking to one of her graduate students, who mentioned a service that allows humans anywhere in the world to complete small online tasks very cheaply. With Amazon Mechanical Turk, she found a way to scale and have thousands of people find the right images for not too much money.

Amazon Mechanical Turk was the solution, but a problem still existed. Not all the workers spoke English as their first language, so there were issues with specific images and the words associated with them. Some words were harder for these remote workers to identify. Not only that, but there were words like “babuin” that confused workers—they did not exactly know which images represented the word. So, her team created a simple algorithm to figure out how many people had to look at each image for a given word. Words that were more complex like “babuin” required more people to check images, and simpler words like “cat” needed only a few people.

With Mechanical Turk, creating ImageNet took less than three years, much less than the initial estimate with only undergraduates. The resulting dataset had around 3 million images separated into about 5,000 “words.” People were not impressed with her paper or dataset, however, because they did not believe that more and more refined data led to better algorithms. But most of these researchers’ opinions were about to change.

The ImageNet Challenge

To prove her point, Li had to show that her dataset led to better algorithms. To achieve that, she had the idea of creating a challenge based on the dataset to show that the algorithms using it would perform better overall. That is, she had to make others train their algorithms with her dataset to show that they could indeed perform better than models that did not use her dataset.

The same year she published the paper in CVPR, she contacted a researcher named Alex Berg and suggested that they work together to publish papers to show that algorithms using the dataset could figure out whether images contained particular objects or animals and where they were located in the picture. In 2010 and 2011, they published five papers using ImageNet.* The first became the benchmark of how algorithms would perform on these images. To make it the benchmark for other algorithms, Li reached out to the team supporting one of the most well-known image recognition dataset and benchmark standards, PASCAL VOC. They agreed to work with Li and added ImageNet as a benchmark for their competition. The competition used a dataset called PASCAL that only had 20 classes of images. By comparison, ImageNet had around 5,000 classes.

As Li predicted, the algorithms that were trained using the ImageNet dataset performed better and better as the competition continued. Researchers learned that algorithms started performing better for other datasets when the models were first trained using ImageNet and then fine-tuned for another task. A detailed discussion on how this worked for skin cancer is in a later section.

A major breakthrough occurred in 2012. The creator of deep learning, Geoffrey Hinton, together with Ilya Sutskever and Alex Krizhevsky submitted a deep convolutional neural network architecture called AlexNet—still used in research to this day—“which beat the field by a whopping 10.8 percentage point margin.”* That marked the beginning of deep learning’s boom, which would not have happened without ImageNet.

ImageNet became the go-to dataset for the deep learning revolution and, more specifically, that of the convolution neural networks (CNNs) led by Hinton. ImageNet not only led the deep learning revolution but also set a precedent for other datasets. Since its creation, tens of new datasets were introduced with more abundant data and more precise classification. Now, they allow researchers to create better models. Not only that, but research labs have focused on releasing and maintaining new datasets for other fields like the translation of texts and medical data.

Figure: Inception Module included in GoogleNet.

In 2015, Google released a new convolutional neural network called Inception or GoogleNet.* It contained fewer layers than the top performing neural networks, but it performed better. Instead of adding one filter per layer, Google added an Inception Module, which includes a few filters that run in parallel. It showed once again that the architecture of neural networks is important.

Figure: ImageNet Top-5 accuracy over time. Top-5 accuracy asks whether the correct label is in at least the classifier’s top five predictions.

ImageNet is considered solved, reaching an error rate lower than the average human and achieving superhuman performance for figuring out if an image contains an object and what kind of object that is. After nearly a decade, the competition to train and test models on ImageNet. Li tried to remove the dataset from the internet, but big companies like Facebook pushed back since they used it as their benchmark.

But since the ending of the ImageNet competition, many other datasets have been created based on millions of images, voice clips, and text snippets entered and shared on their platforms every day. People sometimes take for granted that these datasets, which are intensive to collect, assemble, and vet, are free. Being open and free to use was an original tenet of ImageNet that will outlive the challenge and likely even the dataset. “One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research,” Li said. “People really recognize the importance the dataset is front and center in the research as much as algorithms.”

Data Privacy

Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.Edward Snowden*

In 2014, Tim received a request on his Facebook app to take a personality quiz called “This Is Your Digital Life.” He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task, so he quickly accepted the invitation. Within five minutes of receiving the request on his phone, Tim logged in to the app, giving the company in charge of the quiz access to his public profile and all his friends’ public profiles. He completed the quiz within 10 minutes. A UK research facility collected the data, and Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.

What Tim did not know was that he had just shared his and all of his friends’ data with Cambridge Analytica. This company used Tim’s data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information such as age, income, and gender, psychographic profiles explain why people make purchases. The use of personal data on such a scale made this scheme, which Tim passively participated in, one of the biggest political scandals to date.

Data has become an essential part of deep learning algorithms.* Large corporations now store a lot of data from their users because that has become such a central part of building better models for their algorithms and, in turn, improving their products. For Google, it is essential to have users’ data in order to develop the best search algorithms. But as companies gather and keep all this data, it becomes a liability for them. If a person has pictures on their phone that they do not want anyone else to see, and if Apple or Google collects those pictures, their employees could have access to them and abuse the data. Even if these companies protect against their own employees having access to the data, a privacy breach could occur, allowing hackers access to people’s private data.

Hacks resulting in users’ data being released are very common. Every year, it seems the number of people affected by a given hack increases. One Yahoo hack compromised 3 billion people’s accounts.* So, all the data these companies have about their users becomes a burden. At other times, data is given to researchers, expecting the best of their intentions. But researchers are not always sensitive when handling data. That was the case with the Cambridge Analytica scandal. In that instance, Facebook provided researchers access to information about users and their friends that was mainly in their public profiles, including people’s names, birthdays, and interests.* This private company then used the data and sold it to political campaigns to target people with personalized ads based on their information.

Differential Privacy

Differential privacy is a way of obtaining statistics from a pool of data from many people without revealing what data each person provided.

Keeping sensitive data or giving data directly to researchers for creating better algorithms is dangerous. Personal data should be private and stay that way. As far back as 2006, researchers at Microsoft were concerned about users’ data privacy and created a breakthrough technique called the differential privacy, but they never used it in their products. Ten years later, Apple released products on the iPhone using this same method.

Figure: How differential privacy works.

Apple implements one of the most private versions of differential privacy, called the local model. It adds noise to the data directly on the user’s device before sending it to Apple’s servers. In that way, Apple never touches the user’s true data, preventing anyone other than the user from having access to it. Researchers can analyze trends of people’s data but are never able to access the details.*

Differential privacy does not merely try to make users’ data anonymous. It allows companies to collect data from large datasets, with a mathematical proof that no one can learn about a single individual.*

Imagine that a company wanted to collect the average height of their users. Anna is 5 feet 6 inches, Bob is 5 feet 8 inches, and Clark is 5 feet 5 inches. Instead of collecting the height individually from each user, Apple collects the height plus or minus a random number. So, it would collect 5 feet 6 inches plus 1 inch for Anna, 5 feet 8 inches plus 2 inches for Bob, and 5 feet 5 inches minus 3 inches for Clark, which equals 5 feet 7 inches, 5 feet 10 inches, and 5 feet 2 inches, respectively. Apple averages these heights without the names of the users.

The average height of its users would be the same before and after adding the noise: 5 feet 6 inches. But Apple would not be collecting anyone’s actual height, and their individual information remains secret. That allows Apple and other companies to create smart models without collecting personal information from its users, thus protecting their privacy. The same technique could produce models about images on people’s phones and any other information.

Differential privacy, or keeping users’ data private, is much different from anonymization. Anonymization does not guarantee that information the user has, like a picture, is not leaked or that the individual cannot be traced back from the data. One example is to send a pseudonym of a person’s name but still transmit their height. Anonymization tends to fail. In 2007, Netflix released 10 million movie ratings from its viewers in order for researchers to create a better recommendation algorithm. They only published ratings, removing all identifying details.* Researchers, however, matched this data with public data on the Internet Movie Database (IMDb).* After matching patterns of recommendations, they added the names back to the original anonymous data. That is why differential privacy is essential. It is used to prevent user’s data from being leaked in any possible way.

Figure: Emoji usage across different languages.

This figure shows the usage percentage of each emoji over the total usage of emojis for English- and French-speaking countries. The data was collected using differential privacy. The distribution of the usage of emojis in English-speaking countries differs from that of French-speaking nations. That might reveal underlying cultural differences that translate to how each culture uses language. In this case, how frequently they use each emoji is interesting.

Apple started using differential privacy to improve its predictive keyboard,* the Spotlight search, and the Photos app. It was able to advance these products without obtaining any specific user’s data. For Apple, privacy is a core principle. Tim Cook, Apple’s CEO, has time and time again called for better data privacy regulation.* The same data and algorithms that can be used to enhance people’s lives can be used as a weapon by bad actors.

Apple predictive keyboard, with data it collected with differential privacy, helps users by offering the next word that should be in the text based on its models. Apple has also been able to create models for what is inside people’s pictures on their iPhones without having actual users’ data. It is possible for users to search for specific items like “mountains,” “chairs,” and “cars” in their pictures. And, all of that is served by models developed using differential privacy. Apple is not the only one using differential privacy in its products. In 2014, Google released a system for its Chrome web browser to figure out users’ preferences without invading their privacy.

But Google has also been working with other technologies to produce better models while continuing to keep users’ data private.

Federated Learning

Google developed another technique called federated learning.* Instead of collecting statistics on users, Google developed an in-house model and then deployed it to each of the users’ computers, phones, and applications. Then, the model is trained based on the data generated by the user or that is already present.

For example, if Google wants to create a neural network to identify objects in pictures and has a model of how “cats” look but not how “dogs” look, then the neural network is sent to a user’s phone that contains many pictures of dogs. From that, it learns what dogs look like, updating its weights. Then, it summarizes all of the changes in the model as a small, focused update. The update is sent to the cloud, where it is averaged with other users’ updates to improve the shared model. Everyone’s data advances the model.

Federated learning* works without the need to store user data in the cloud, but Google is not stopping there. They have developed a secure aggregation protocol that uses cryptographic techniques so that they can only decrypt the average update if hundreds or thousands of users have participated.* That guarantees that no individual phone’s update can be inspected before averaging it with other users’ data, thus guarding people’s privacy. Google already uses this technique in some of its products, including the Google keyboard that predicts what users will type. The product is well-suited for this method since users type a lot of sensitive information into their phone. The technique keeps that data private.

This field is relatively new, but it is clear that these companies do not need to keep users’ data to create better and more refined deep learning algorithms. In the years to come, more hacks will happen, and users’ data that has been stored to improve these models will be shared with hackers and other parties. But that does not need to be the norm. Privacy does not necessarily need to be traded to get better machine learning models. Both can co-exist.

Deep Learning Processors

What’s not fully realized is that Moore’s Law was not the first but the fifth paradigm to bring exponential growth to computers. We had electromechanical calculators, relay-based computers, vacuum tubes, and transistors. Every time one paradigm ran out of steam, another took over.Ray Kurzweil*

The power of deep learning depends on the design as well as the training of the underlying neural networks. In recent years, neural networks have become complicated, often containing hundreds of layers. This imposes higher computational requirements, causing an investment boom in new microprocessors specialized for this field. The industry leader Nvidia earns at least $600M per quarter for selling its processors to data centers and companies like Amazon, Facebook, and Microsoft.

Facebook alone runs convolutional neural networks at least 2 billion times each day. That is just one example of how intensive the computing needs are for these processors. Tesla cars with Autopilot enabled also need enough computational power to run their software. To do so, Tesla cars need a super processor: a graphics processing unit (GPU).

Most of the computers that people use today, including smartphones, contain a central processing unit (CPU). This is the part of the machine where all the computation happens, that is, where the brain of the computer resides. A GPU is similar to a CPU because it is also made of electronic circuits, but it specializes in accelerating the creation of images in video games and other applications. But the same operations that games need in order to appear on people’s screens are also used to train neural networks and run them in the real world. So, GPUs are much more efficient for these tasks than CPUs. Because most of the computation needed is in the form of neural networks, Tesla added GPUs to its cars so that they can drive themselves through the streets.

Nvidia, a company started by Taiwanese immigrant Jensen Huang,* produces most of the GPUs that companies use, including Tesla, Mercedes, and Audi.* Tesla uses the Nvidia Drive PX2, which is designed for self-driving cars.* The Nvidia Drive processor has a specialized instruction set that accelerates neural networks’ performance at runtime and can compute 8 TFLOPS, meaning 8 trillion floating-point math operations per second.

The TFLOP (trillion floating-point math operations per second) is a unit for measuring the performance of chips used to compare the power that a certain chip has for processing neural networks.

Booming demand for Nvidia’s products has supercharged the company’s growth.* From January 2016 to August 2021, the stock has soared from $7 to $220.* Most of the money that Nvidia makes today comes from the gaming industry, but even though auto applications are a new field for them, they already represent $576M annually or 2% of its revenue. And the self-driving industry is just beginning.*

Video games were the flywheel, or the killer app as it’s called in Silicon Valley, for the company. They have an incredibly high potential sales volume and at the same time represent one of the most computationally challenging problems. Video games helped Nvidia enter the market of GPUs, funding R&D for making more powerful processors.

The amount of computation that GPUs, like CPUs, can handle has followed an exponential curve over the years. Moore’s Law is an observation that the number of transistors—the basic element of a CPU—doubles roughly every two years.*

Gordon Moore, co-founder of Intel, one of the most important companies developing microprocessors, created this law of improvement. The computational power of CPUs has increased exponentially. In the same way, the number of operations, the TFLOPS, that the GPUs can process has followed the same exponential curve, adhering to Moore’s Law.

But even with the growing capacity of GPUs, there was a need for more specialized hardware developed specifically for deep learning. As deep learning became more and more widely used, the demand for processing units tailored for the technique outgrew what GPUs could provide. So, large corporations started developing equipment specifically designed for deep learning. And, Google was one of those companies. When Google concluded that it needed twice as many CPUs as they had in their data centers to support their deep learning models for speech recognition, it created a group internally to develop hardware intended to process neural networks more efficiently. To deploy the company’s models, it needed to develop a specialized processor.

Tensor Processing Unit

In its quest to make a more efficient processor for neural networks, Google developed what is called a tensor processing unit (TPU). The name comes from the fact that the software uses the TensorFlow language, which we discussed previously. The calculations like multiplication or linear algebra that TPUs handle do not need as much mathematical precision as the video processing that GPUs do, which means that TPUs need fewer resources and can do many more calculations per second.

Google released its first TPU in 2016. This version of their deep learning processor was solely targeted for inference, meaning it only focused on running networks that had already been trained. Inference works in such a way that if there is already a trained model, then that model can run on a single chip. But to train a model, you need multiple chips to get a fast turnaround, which keeps your programmers from waiting a long time to see if it works.

That is a much harder problem to solve because you need to interconnect the chips and ensure that they are in sync and communicating the appropriate messages. So, Google decided to release a second version of TPUs a year later with the added feature that developers could train their models on these chips. And a year later, Google released its third generation of TPUs that could process eight times more than the previous version and had liquid cooling to address their intense use of power.*

To have an idea of how powerful these processing chips are, a single second-generation TPU can run around 120 TFLOPS, or 200 times the calculations of a single iPhone.* Companies are at battle to produce hardware that can perform the fastest processing for neural networks. After Google announced its second-generation TPU units, Nvidia announced its newest GPU called the Nvidia Volta that delivers around 100 TFLOPS.*

But still, TPUs are around 15 to 30 times faster than GPUs, allowing developers to train their models much faster than with the old processors. Not only that, but TPUs are much more energy-efficient compared to GPUs, allowing Google to save a lot of money on electricity. Google is investing heavily in deep learning and related compilers, which is the part of the computer that makes human-readable code into machine-readable code. That means it needs improvements in the physical (hardware) and digital (software) space. This research and development field is so big that Google has entire divisions dedicated to making improvements in different parts of the pipeline of its development.

Google is not the only giant working on their own specialized hardware for deep learning. The latest processor of the iPhone 12 also has a specialized unit called the A14 bionic chip.* This little electronic unit can process up to 0.6 TFLOPS, or 600 billion floating-point operations per second. Some of that processing power is used for facial recognition when unlocking the phone, powering FaceID. Tesla has also developed its own processing chips to run its neural networks, improving its self-driving car software.* The latest chip Tesla developed and released can process up to 36 TFLOPS.*

The size of neural networks has been growing, and thus the processing power required to create and run the models has also increased. OpenAI released a study that showed that the amount of compute used in the largest AI training runs has been increasing exponentially, doubling every 3.5 months. And, they expect that the same growth will continue over the next five years. From 2012 to 2018, the amount of compute used to train these models increased 300,000x.*

Figure: The amount of compute in petaflop/s-day used to train the largest neural networks. A petaflop is a computing speed of floating-point operations per second, and a petaflop/s-day represents that number of operations continued over a day, or about operations.

This growth has parallels in the biological world wherein there is a clear correlation between the cognitive capacity of animals and the number of pallial or cortical neurons. It should follow that the number of neurons of an artificial neural network simulating animals’ brains should affect the performance of these models.

As time passes and the amount of compute used for training deep learning models increases, more and more companies will develop specialized chips to handle the processing, and an increasing number of applications will use deep learning to achieve all types of tasks.

What Can AI Learn from Animal Brains?44 minutes, 38 links

To understand the development of AI algorithms and the way they improve as they learn over time, it is really important to take a step back from artificial intelligence systems and focus on how brains function. As it turns out, AI systems work much the same way as human brains. So, I must first explain, at least at a high level, how animal, and specifically human, brains work.

The most important piece is the theory of Learning to Learn, which describes how the brain learns the technique to learn new topics. The human brain learns and encodes information during sleep or at least in restful awake moments—converting short-term memory to long-term—through hippocampal, visual cortex, and amygdala replay. The brain also uses the same circuitry that decodes the information stored in the hippocampus, visual cortex, and amygdala to predict the future. Again, much like the human brain, AI systems decode previous information to create future scenes, like what may happen next in a video.

Biological Inspirations for Deep Learning

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!