What’s not fully realized is that Moore’s Law was not the first but the fifth paradigm to bring exponential growth to computers. We had electromechanical calculators, relay-based computers, vacuum tubes, and transistors. Every time one paradigm ran out of steam, another took over.Ray Kurzweil*
The power of deep learning depends on the design as well as the training of the underlying neural networks. In recent years, neural networks have become complicated, often containing hundreds of layers. This imposes higher computational requirements, causing an investment boom in new microprocessors specialized for this field. The industry leader Nvidia earns at least $600M per quarter for selling its processors to data centers and companies like Amazon, Facebook, and Microsoft.
Facebook alone runs convolutional neural networks at least 2 billion times each day. That is just one example of how intensive the computing needs are for these processors. Tesla cars with Autopilot enabled also need enough computational power to run their software. To do so, Tesla cars need a super processor: a graphics processing unit (GPU).
Most of the computers that people use today, including smartphones, contain a central processing unit (CPU). This is the part of the machine where all the computation happens, that is, where the brain of the computer resides. A GPU is similar to a CPU because it is also made of electronic circuits, but it specializes in accelerating the creation of images in video games and other applications. But the same operations that games need in order to appear on people’s screens are also used to train neural networks and run them in the real world. So, GPUs are much more efficient for these tasks than CPUs. Because most of the computation needed is in the form of neural networks, Tesla added GPUs to its cars so that they can drive themselves through the streets.
Nvidia, a company started by Taiwanese immigrant Jensen Huang,* produces most of the GPUs that companies use, including Tesla, Mercedes, and Audi.* Tesla uses the Nvidia Drive PX2, which is designed for self-driving cars.* The Nvidia Drive processor has a specialized instruction set that accelerates neural networks’ performance at runtime and can compute 8 TFLOPS, meaning 8 trillion floating-point math operations per second.
The TFLOP (trillion floating-point math operations per second) is a unit for measuring the performance of chips used to compare the power that a certain chip has for processing neural networks.
Booming demand for Nvidia’s products has supercharged the company’s growth.* From January 2016 to August 2021, the stock has soared from $7 to $220.* Most of the money that Nvidia makes today comes from the gaming industry, but even though auto applications are a new field for them, they already represent $576M annually or 2% of its revenue. And the self-driving industry is just beginning.*
Video games were the flywheel, or the killer app as it’s called in Silicon Valley, for the company. They have an incredibly high potential sales volume and at the same time represent one of the most computationally challenging problems. Video games helped Nvidia enter the market of GPUs, funding R&D for making more powerful processors.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
The amount of computation that GPUs, like CPUs, can handle has followed an exponential curve over the years. Moore’s Law is an observation that the number of transistors—the basic element of a CPU—doubles roughly every two years.*
Gordon Moore, co-founder of Intel, one of the most important companies developing microprocessors, created this law of improvement. The computational power of CPUs has increased exponentially. In the same way, the number of operations, the TFLOPS, that the GPUs can process has followed the same exponential curve, adhering to Moore’s Law.
But even with the growing capacity of GPUs, there was a need for more specialized hardware developed specifically for deep learning. As deep learning became more and more widely used, the demand for processing units tailored for the technique outgrew what GPUs could provide. So, large corporations started developing equipment specifically designed for deep learning. And, Google was one of those companies. When Google concluded that it needed twice as many CPUs as they had in their data centers to support their deep learning models for speech recognition, it created a group internally to develop hardware intended to process neural networks more efficiently. To deploy the company’s models, it needed to develop a specialized processor.
Tensor Processing Unit
In its quest to make a more efficient processor for neural networks, Google developed what is called a tensor processing unit (TPU). The name comes from the fact that the software uses the TensorFlow language, which we discussed previously. The calculations like multiplication or linear algebra that TPUs handle do not need as much mathematical precision as the video processing that GPUs do, which means that TPUs need fewer resources and can do many more calculations per second.
Google released its first TPU in 2016. This version of their deep learning processor was solely targeted for inference, meaning it only focused on running networks that had already been trained. Inference works in such a way that if there is already a trained model, then that model can run on a single chip. But to train a model, you need multiple chips to get a fast turnaround, which keeps your programmers from waiting a long time to see if it works.
That is a much harder problem to solve because you need to interconnect the chips and ensure that they are in sync and communicating the appropriate messages. So, Google decided to release a second version of TPUs a year later with the added feature that developers could train their models on these chips. And a year later, Google released its third generation of TPUs that could process eight times more than the previous version and had liquid cooling to address their intense use of power.*
To have an idea of how powerful these processing chips are, a single second-generation TPU can run around 120 TFLOPS, or 200 times the calculations of a single iPhone.* Companies are at battle to produce hardware that can perform the fastest processing for neural networks. After Google announced its second-generation TPU units, Nvidia announced its newest GPU called the Nvidia Volta that delivers around 100 TFLOPS.*
But still, TPUs are around 15 to 30 times faster than GPUs, allowing developers to train their models much faster than with the old processors. Not only that, but TPUs are much more energy-efficient compared to GPUs, allowing Google to save a lot of money on electricity. Google is investing heavily in deep learning and related compilers, which is the part of the computer that makes human-readable code into machine-readable code. That means it needs improvements in the physical (hardware) and digital (software) space. This research and development field is so big that Google has entire divisions dedicated to making improvements in different parts of the pipeline of its development.
Google is not the only giant working on their own specialized hardware for deep learning. The latest processor of the iPhone 12 also has a specialized unit called the A14 bionic chip.* This little electronic unit can process up to 0.6 TFLOPS, or 600 billion floating-point operations per second. Some of that processing power is used for facial recognition when unlocking the phone, powering FaceID. Tesla has also developed its own processing chips to run its neural networks, improving its self-driving car software.* The latest chip Tesla developed and released can process up to 36 TFLOPS.*
The size of neural networks has been growing, and thus the processing power required to create and run the models has also increased. OpenAI released a study that showed that the amount of compute used in the largest AI training runs has been increasing exponentially, doubling every 3.5 months. And, they expect that the same growth will continue over the next five years. From 2012 to 2018, the amount of compute used to train these models increased 300,000x.*
Figure: The amount of compute in petaflop/s-day used to train the largest neural networks. A petaflop is a computing speed of floating-point operations per second, and a petaflop/s-day represents that number of operations continued over a day, or about operations.
This growth has parallels in the biological world wherein there is a clear correlation between the cognitive capacity of animals and the number of pallial or cortical neurons. It should follow that the number of neurons of an artificial neural network simulating animals’ brains should affect the performance of these models.
As time passes and the amount of compute used for training deep learning models increases, more and more companies will develop specialized chips to handle the processing, and an increasing number of applications will use deep learning to achieve all types of tasks.