There are many types of deep neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs), and each has different properties. For example, recurrent neural networks are deep neural networks in which neurons in higher layers connect back to the neurons in lower layers. Here, we’ll focus on convolutional neural networks, which are computationally more efficient and faster than most other architectures.* They are extremely relevant as they are used for state-of-the-art text translation, image recognition, and many other tasks.
Figure: A recurrent neural network, where one of the neurons feeds back into a previous layer.
The first time Yann LeCun revolutionized artificial intelligence was a false start.* By 1995, he had dedicated almost a decade to what was considered a bad idea according to many computer scientists: that mimicking some features of the brain would be the best way to make artificial intelligence algorithms better. But LeCun finally demonstrated that his approach could produce something strikingly smart and useful.
At Bell Labs, LeCun worked on software that simulated how the brain works, more specifically, how the visual cortex functions. Bell Labs, a research facility owned by the then gigantic AT&T, employed some of the eminent computer scientists of the era. One of the Unix operating systems, which became the basis for Linux, macOS, and Android, was developed there. Not only that, but the transistor, the base of all modern computer chips, as well as the laser and two of the most widely used programming languages to date, C and C++, were also developed there. It was a hub of innovation, so it was not a coincidence that one of the most important deep learning architectures was born in the same lab.
Figure: An image of the primary visual cortex.
LeCun based his work on research done by Kunihiko Fukushima, a Japanese computer researcher.* Kunihiko created a model of artificial neural networks based on how vision works in the human brain. The architecture was based on two types of neuron cells in the human brain called simple cells and complex cells. They are found in the primary visual cortex, the part of the brain that processes visual information.
Simple cells are responsible for detecting local features, like edges. Complex cells pool the results that simple cells produce within an area. For example, a simple cell may detect an edge that may represent a chair. Complex cells aggregate that information by informing the next higher level what the simple cells detected in the layer below.
The architecture of a CNN is based on a cascading model of these two types of cells, and it is mainly used for pattern recognition tasks. LeCun produced the first piece of software that could read handwritten text by looking at many different examples using this CNN model. With this work, AT&T started selling the first machines capable of reading handwriting on checks. For LeCun, this marked the beginning of a new era where neural networks would be used in other fields of AI. Unfortunately, it was not to be.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Figure: Yann LeCun, head of Facebook AI Research.
The same day that LeCun celebrated the launch of bank machines that could read thousands of checks per hour, AT&T announced it was splitting into three different companies, the result of an antitrust lawsuit by the US government. At that point, LeCun became the head of research at a much smaller AT&T and was directed to work on other things. In 2002, he left and eventually became head of the Facebook AI Research group.
LeCun continued working in neural networks, especially in convolutional neural networks, and slowly the rest of the machine learning world came around to the technology. In 2012, some of his students published a paper that demonstrated using CNNs to classify real-world house numbers better than all previous algorithms had been able to do. Since then, deep neural networks have exploded in use, and now most of the research developed in machine learning focuses on deep learning. Convolutional neural networks spread widely and have been used to beat most of the other algorithms for many applications, including natural language processing and image recognition.
The efforts of the team paid off. In 2017, every photo uploaded to Facebook was processed by multiple CNNs. One of them identified which people were in the picture, and another determined if there were objects in the picture. At that time, around 800 million photos were uploaded per day, so the throughput of the CNNs was impressive.
How Convolutional Neural Networks Work
A convolutional neural network (or CNN), is a multilayer neural network. It is named as such because it contains hidden layers that perform convolutions. A convolution is a mathematical function that is the integral of the product of the two functions after one is reversed and shifted. For images, it means that you are running filters on the whole image and producing images with those filters.
Most notably, most inputs to CNNs consist of images.*
On the layers performing convolution, each neuron walks through the image, multiplying the number representing each pixel by the corresponding weight in the neuron, generating a new image as the output.
Let’s examine how a convolutional neural network classifies images. First, we need to make an image of something that a neural network can work with. An image is just data. We represent each pixel of the image as a number; in a black-and-white image, this can indicate how black that pixel is. The figure below represents the number 8. In this representation, 0 is white, and 255 is completely black. The closer the number is to 255, the darker the pixel is.
Figure: An image of eight (left) and the representation of eight in numbers (right).
Figure: The image on the top represents a regular neural network, while the bottom image represents a CNN. Every layer of a CNN transforms the 3D input volume into a 3D output volume.
Think of each neuron as a filter that goes through the entire image. Each layer may have multiple neurons. The figure below shows two neurons walking through the entire image. The red neuron first walks through the image, and the green neuron does the same producing a new resulting image.
The resulting images can go directly to the next layer of the neural network and are processed by those neurons. The processed images in one layer can also be processed by a method called pooling before going to the next layer. The function of pooling is to simplify the results from the previous layers. This may consist of getting the maximum number that the pixels represent in a certain region (or neighborhood) or summing up the numbers in a neighborhood. This is done in multiple layers. When a neuron runs through an image, the next layer produces a smaller image by truncating data. This process is repeated over and over through successive layers. In the end, the CNN produces a list of numbers or a single number, depending on the application.
Figure: The image on the left shows what pooling looks like. The image on the right represents how one of the neurons filters and generates new images based on the input, that is, the convolution operation.
Based on the result, the image can then be classified based on what the system is looking for. For example, if the resulting number is positive, the image can be classified as containing a hot dog, and if the resulting number is negative, then the image is classified as not containing a hot dog. But this assumes that we know what each neuron looks like, that is, what the filter looks like for every layer. In the beginning, the neurons are completely random, and by using the backpropagation technique, the neurons are updated in such a way that they produce the desired result.
Figure: An image of a cat that goes through a multilayer neural network. In the last step of this neural network, a number comes out. If it is positive, then the neural network classifies the image as a cat. If it is negative, it classifies the image as a dog.
A CNN is trained by showing it many images tagged with their results—the labels. This set is called the training data. The neural network updates its weights, based on whether it classifies the images properly or not, using the backpropagation algorithms. After the training stage, the resulting neural network is the one used to classify new images. Even though CNNs were created based on how the visual cortex works, they can also be used in text, for example. To do that, the inputs are translated to a matrix to match the format of an image.
There is a misconception that deep neural networks are a black box, that is, that there is no way of knowing what they are doing. The thing is that there is no way of determining for every input, either image, sound, or text, what the resulting output is, or if the network is going to classify it correctly. But that does not mean that there is no way of determining what each layer does in a neural network.
Figure: How the filters look (gray) for a CNN classifying objects and the corresponding images that activate these filters. The filters in these images in Layer 1 detect edges and, in Layer 2, detect waves and other patterns. Visualizations of Layer 1 and 2. Each layer illustrates 2 pictures, one which shows the filters themselves and one that shows the parts of the image that are most strongly activated by the given filter. For example, in the space labeled Layer 2, we have representations of the 16 different filters (on the left).
In fact, for CNNs, you see what the filters look like and what kind of images activate each layer. The weights in each neuron can be interpreted as pictures. The figure above shows the filters at different layers and also some examples of images that activate these layers. For example, in the first layer of a multilayer CNN, the filters, or weights, for the neurons look like edges. That means that the filters will activate when edges are found. The second layer of filters shows that the types of images that they activate are a little more complex, with eyes, curves, and other shapes. The third layer activates with images such as wheels, profiles of people, birds, and faces. That means that at each layer of the neural network, more complex images are filtered. The first layer filters and passes the information to the next layer that says if an area contains edges or not. Then, the next layer uses that information, and from the detected edges, it will try to find wheels and so forth. The last layer will identify the categories that humans want to know about: it will identify, for example, whether the image contains a cat, hot dog, or human.