It is the obvious which is so difficult to see most of the time. People say ‘It’s as plain as the nose on your face.’ But how much of the nose on your face can you see, unless someone holds a mirror up to you?Isaac Asimov, I, Robot*
In a blue room with three judges, the best Go player in the last decade, Lee Sedol, plays against an amateur, Aja Huang, who is assisted by an artificial intelligence (AI) system. Via a computer screen on Huang’s left, AlphaGo instructs him on where to place each piece on the board. The match is a mark in history for artificial intelligence. If Huang wins, it will be the first time an AI system has beaten the highest ranked Go player.
Many photographers and videographers stand in the room to stream the match to the millions watching, both live and on replay. Lee Sedol chooses the black stones, giving him the chance to start and his opponent seven and a half points as compensation.
The match between Sedol and AlphaGo started intensely. AlphaGo used strategies that only the very professional players use, and the commentators were surprised at how human it looked. But AlphaGo was far from human. It calculated all the best options and could predict and place each piece in the best spot on the board. As the match went on, Sedol began to feel more nervous. After a surprising move by the AI, Sedol looked at Huang’s face to try to understand what his opponent was feeling, a technique used by Go players. But this time, he could not read his opponent because AlphaGo had no expression.
Then, Huang placed a white stone in a location that seemed an error. Commentators did not understand why AlphaGo would make such a rookie mistake. But in fact, AlphaGo had made all the calculations, and it was about to win the game. Almost four hours after the match started, Sedol was unable to beat this superhuman being. He resigned, defeated. It was the first time that a computer had beaten the world champion of Go, representing an extraordinary achievement in the development of artificial intelligence. By the end of the March 2016 tournament, AlphaGo had beaten Sedol four out of the five games.
While the exact origins are unknown, Go dates from around 3,000 years ago. The rules are simple, but the board is a 19-by-19 grid with 361 points to place pieces, meaning the game has more possible positions than atoms in the universe. Therefore, the game is extremely hard to master. Go players tend to look down on chess players because of the exponential difference in complexity.
Chess is a game where Grandmasters already know the openings and strategies and, in a way, play not to make mistakes. Go, however, has many more options and requires thinking about the correct strategy and making the correct moves early on. Throughout Go’s history, three momentous shifts have taken place regarding how to play the game. Each of these eras represented a total change in the strategies used by Go’s best players.
Warlord Tokugawa led the first revolution in the 1600s, increasing the popularity of Go as well as raising the needed skill level.* The second transformation occurred in the 1930s. Go Seigen, one of the greatest Go players of the 20th century, and Kitani Minoru departed from the traditional opening moves and introduced Shinfuseki, making a profound impact on the game.*
The latest revolution happened in front of a global audience watching Sedol play AlphaGo. Unlike the first transformations, the third shift was brought about not by a human but rather by a computer program. AlphaGo not only beat Sedol, but it played in ways that humans had never seen or played before. It used strategies that would shape the way Go was played from then on.
It was not a coincidence that a computer program beat the best human player: it was due to the development of AI and, specifically, Go engines over the preceding 60 years. It was bound to happen.
Figure: Elo ratings of the most important Go AI programs.
This figure shows the Elo ratings—a way to measure the skill level of players in head-to-head competitions—of the different Go software systems. The colored ovals indicate the type of software used. Each technical advancement for Go engines represented a performance jump in the best of them. But even with the same Go engine, the best AI players performed better over time, showing the probable effect that better hardware had in how Go engines executed: the faster the computer, the better it played.
Go engines are only one example of the development of artificial intelligence systems. Some argue that the reason why AI works well in games is that game engines are their own simulations of the world. That means that AI systems can practice and learn in these virtual environments.
People may often misunderstand the term artificial intelligence for several reasons. Some associate AI with how it is presented on TV shows and movies, like The Jetsons or 2001: A Space Odyssey. Others link it with puppets that look like humans but do not present any intelligence. Yet, these are inaccurate representations of AI. In actuality, artificial intelligence encompasses several pieces of software we interact with daily, from the Spotify playlist generator to voice assistants like Alexa and Siri. People especially associate AI with robots, often those that walk and talk like humans. AI is like a brain, and a robot is just one possible container for it. And the vast majority of robots today don’t look or act like humans. Think of a mechanical arm in an assembly line. An AI may control the arm to, among other things, recognize parts and orient them correctly.
Computer science defines artificial intelligence (or AI) as the capability of a machine to exhibit behaviors commonly associated with human intelligence. It is called that to contrast with the natural intelligence found in humans and other animals. Computer scientists also use the term to refer to the study of how to build systems with artificial intelligence.
For example, an AI system can trade stocks or answer your requests and can run on a computer or phone. When people think of artificial intelligence systems, they typically compare them to human intelligence. Because computer science defines AI as an approach to solving particular tasks, one of the ways to compare it to human intelligence is to measure its ability to achieve specific functions in comparison to the best humans.
Software and hardware improvements are making AI systems perform much better in specific tasks. Below, you see some of the milestones at which artificial intelligence has outperformed humans and how these accomplishments are becoming more frequent over time.
Figure: Artificial intelligence systems continue to do better and better at different tasks, eventually outperforming humans.
Artificial general intelligence (AGI) is an AI system that outperforms humans across the board, that is, a computer system that can execute any given intellectual task better than a human.
For some tasks like Go, AI systems now perform better than humans. And, the trend shows that AI systems are working better than humans at harder and harder assignments and doing so more often. That is, the trend suggests that artificial general intelligence is within reach.
In the first section of Making Things Think, I talk about the history and evolution of AI, explaining in layperson’s terms how these systems work. I cover the critical technical developments that caused big jumps in performance but not some of the topics that are either too technical or not as relevant, such as k-NN regression, identification trees, and boosting. Following that, I talk about deep learning, the most active area of AI research today, and cover the development trends of those methods and the players involved. But more importantly, I explain why Big Data is vital to the field of AI. Without this data, artificial intelligence cannot succeed.
The next section describes how the human brain works and compares it to the latest AI systems. Understanding how biological brains work and how animals and humans learn with them can shed light on possible paths for AI research.
We then turn to robotics, with examples of how industry is using AI to push automation further into supply chains, households, and self-driving cars.
The next section contains examples of artificial intelligence systems in industries such as space, real estate, and even our judicial system. I describe the use of AI in specific real-world situations, linking it back to the information presented earlier in the book. The final section contains risks and impacts of AI systems. This section starts with how these systems can be used for surveillance, and it includes the economic impact of AI and ends with a discussion of the possibility of AGI.
controversyIf current trends continue, some AI researchers believe society might develop artificial general intelligence by the end of the century. Some people say this is not possible because such a system will never have consciousness nor the same creativity as humans. Others argue that AI systems do not present the same type of capabilities as human brains. While I personally believe that AGI will be achieved, I will not debate in this book whether such a system is possible or not; I show the trends, and the task of figuring out if it will happen is left to you, the reader.*
I was born and raised in São Paulo, Brazil. I was lucky enough to be one of the two Brazilians selected to the undergraduate program at MIT. Coming to the US to study was a dream come true. I was really into mathematics and ended up publishing some articles in the field,* but I ended up loving computer science, specifically focusing on artificial intelligence.
I’ve since spent almost a decade in the field of artificial intelligence, from my Masters in machine learning to my time working on a company that personalizes emails and ads for the largest e-commerce brands in the world. Over these years, I’ve realized how much these systems were affecting people’s everyday lives, from self-driving car software to recommending videos. One credible prediction is that artificial intelligence could scale from about $2.5 trillion to $87 trillion in enterprise value by 2030; for comparison, the internet has generated around $12 trillion dollars of enterprise value from 1997 to 2021.*
Even though these systems are everywhere and will become more important over time as they become more capable, few people have a concrete idea of how they work. On top of that, we see both fanfare and fear in the news about the capability of these systems. But the headlines often dramatize certain problems, focus on unrealistic scenarios, and neglect important facts about how AI and recent developments in machine learning work—and are likely to affect our lives.
As an engineer, I believe you should start with the facts. This book aims to explain how these systems work so you can have an informed opinion, and assess for yourself what is reality and what is not.
But to do that, you also need context. Looking to the past (including inaccurate predictions from the past) informs our view of the future. So the book covers the history of artificial intelligence, going over its evolution and how the systems have been developed. I hope this work can give an intelligent reader a practical and realistic understanding of the past, present, and possible future of artificial intelligence.
Some people say that “it takes a village” to raise a child. I believe that everything that we build “takes a village.” This book is no exception.
I want to thank my late grandmother Leticia for giving me the aid that I needed. Without her help, I wouldn’t have been able to graduate from MIT.
My mom made me believe in dreams, and my dad taught me the value of a strong work ethic. Thank you for raising me.
I want to thank my brother and his family for always supporting me when I need it, and my whole family for always helping me out. Thanks also go to my friends, especially two close friends, Aldo Pacchiano and Beneah Kombe, for being my extended family.
I would also like to thank Paul English for being an amazing mentor and an incredible person overall, and Homero Esmeraldo, Petr Kaplunovich, Faraz Khan, James Wang, Adam Cheyer, and Samantha Mason for revising this book.
Finally, I want to thank my wife, Juliana Santos, for being with me on the ups and downs, and especially on the downs. Thank you for being with me on this journey!
The advancement of artificial intelligence (AI) has not been a straight path—there have been periods of booms and busts. This first section discusses each of these eras in detail, starting with Alan Turing and the initial development of artificial intelligence at Bletchley Park in England, and continuing to the rise of deep learning.
The 1930s to the early 1950s saw the development of the Turing machine and the Turing test, which were fundamental in the early history of AI. The official birth of artificial intelligence was in the mid-1950s with the onset of the field of computer science and the creation of machine learning. The year 1956 ushered in the golden years of AI with Marvin Minsky’s Micro-Worlds.
For eight years, AI experienced a boom in funding and growth in university labs. Unfortunately, the government, as well as the public, became disenchanted with the lack of progress. While producing solid work, those in the field had overpromised and underdelivered. From 1974 to 1980, funding almost completely dried up, especially from the government. There was much criticism during this period, and some of the negative press came from AI researchers themselves.
In the 1980s, computer hardware was transitioning from mainframes to personal computers, and with this change, companies around the world adopted expert systems. Money flooded back into AI. The downside to expert systems was that they required a lot of data, and in the 1980s, storage was expensive. As a result, most corporations could not afford the cost of AI systems, so the field experienced its second bust. The rise of probabilistic reasoning ends the first section of Making Things Think at around the year 2001.
I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.Alan Turing*
This chapter covers Alan Turing, the initial developments of artificial intelligence at Bletchley Park in England, and how they helped break Germany’s codes and win World War II. I also describe the development of the Turing machine and Turing test, which became the golden test for testing artificial intelligence systems for decades. We’ll also meet Arthur Samuel and Donald Michie, who started early developments in artificial intelligence systems and created engines for systems to play games.
During the Second World War, the British and the Allies had the help of thousands of codebreakers located at Bletchley Park in the UK. In 1939, one of these sleuths, Alan Turing, a young mathematician and computer scientist, was responsible for the design of the electromechanical machine named the Bombe. The British used this device to break the German Enigma Cipher.* At the same location in 1943, Tommy Flowers, with contributions from Turing, designed the Colossus, a set of computers built with vacuum tubes, to help the Allies crack the Lorenz Cipher.* These two devices helped break the German codes and predict Germany’s strategy. According to US General Eisenhower, cracking the enemy codes was decisive for the Allies winning the war.
These events marked the initial development of artificial intelligence. In their free time during the war, Turing and Donald Michie, a cryptographer recruited to Bletchley Park, had a weekly chess game. While playing, they talked about how to write a computer program that would play against human opponents and beat them. They sketched their designs with pen and paper. Unfortunately, they never went ahead and coded their program. At the time, the state-of-the-art computer was the Atanasoff-Berry computer, designed to only solve linear equations. It would have been very hard for the pair to code a program that could beat humans using such computers. However, these meetings contributed to the early beginnings of the artificial intelligence field. Because of his work during and after the war, Turing became known as the father of theoretical computer science and artificial intelligence.
But when the war ended, the group that once worked together at Bletchley Park parted ways. Turing, however, did not stop his research; he continued in the computer field. He had already made a name for himself before the war with his seminal 1936 paper* on computing, explaining how machines like computers worked. This mathematical structure became the basis for modeling computers and was later named the Turing machine. Between 1945 and 1947, Turing designed the Automatic Computing Engine (ACE), an early electronic stored-program computer, at the National Physical Laboratory. He continued pursuing the idea of writing a chess program and worked on the theoretical framework for doing so. By 1948, he, working with David Champernowne, a former undergraduate colleague, began coding the program even though no computer at the time could run it. By 1950, he had finished Turochamp.
In 1952, he tried to implement Turochamp on a Ferranti Mark 1, the first commercially available general-purpose electronic computer. But the machine lacked enough computing power to execute Turochamp. Instead, Turing ran the computer program by hand, flipping through the pages of the algorithm. This exercise marked the first demonstration of a working artificial intelligence system. It would take 45 more years for a computer program to win against a chess world champion. The humble beginnings of AI started with Turing’s work.
With the rapid development in computing and AI, Turing wrote about the future of the field in his 1950 seminal paper “Computing Machinery and Intelligence.”* He predicted that by the 2000s, society’s opinion regarding artificial intelligence would shift completely due to technological advances. His prediction, in some ways, turned out to be correct.
Neural networks are computer systems that are modeled (more or less loosely) on how neurons in the human brain function and interact.
By 1945, Turing was already thinking about how to simulate the human brain with a computer. His Automatic Computing Engine created models of how the brain worked. In a letter to a coworker, he wrote, “I am more interested in the possibility of producing models of the action of the brain than in the practical applications to computing … although the brain may in fact operate by changing its neuron circuits by the growth of axons and dendrites, we could nevertheless make a model, within the ACE, in which this possibility was allowed for, but in which the actual construction of the ACE did not alter, but only the remembered data …”*
In 1948, Turing defined two types of unorganized machines, which would be the first computer models of brains and become the basis of neural networks. He based one on how transistors work and the other on how neural networks would eventually be modeled. Around the same time, he also defined genetic search to configure his unorganized machines by searching for the best model of a neural network for a given task.
Figure: Alan Turing, who founded the fields of theoretical computer science and artificial intelligence.
The Turing test is a game where players try to guess which of two participants is a computer. The evaluators are only aware that one of the two participants is a computer. The conversation uses text-only communication like a computer screen. If the judges cannot reliably tell the machine from the human, then the computer passes the test and can be said to exhibit human-level intelligence.
Alan Turing defined the Imitation Game in 1950; it later became more commonly known as the Turing test and became the golden test for figuring out if a computer exhibits the same intelligence as a human. In the party game that inspired the Imitation Game, a man and a woman occupy different rooms, and the onlookers try to guess who is in which room by reading their typewritten responses to questions. The contestants answer in a way that tries to convince the judges that they are the other person. In the Turing test, instead of a man and a woman, the interaction happens between a human and a computer.
Figure: The Turing test. During the Turing test, the human questioner asks a series of questions to both respondents. After the specified time, the questioner tries to decide which terminal is operated by the human respondent and which terminal is operated by the computer.
After the war, Michie, Turing’s friend and fellow codebreaker, became a senior lecturer in surgical science at the University of Edinburgh. Even though his day job was not related to AI, he continued working on the development of artificial intelligence systems, especially games.
Michie did not have access to a digital computer because they were too costly at the time. While many hurdles existed, he developed a program to play a perfect tic-tac-toe game with 304 matchboxes, each representing a unique board state.* Michie’s machine not only played tic-tac-toe but was also able to improve on its own over time—learning how to better play the game.
In 1949, a coder named Arthur Samuel, an expert on vacuum tubes and transistors at IBM, brought IBM’s first commercial general-purpose digital computers to the market. On the side, he worked to implement a program that, by 1952, could play checkers against a human opponent. It was the first artificial intelligence program to be written and run in the United States. He worked tirelessly, and in 1956, he demonstrated it to the public. Samuel improved the underlying software by hand, and when he had access to a computer, he made changes there.
As time passed, he started wondering if the machine could make the same improvements by itself, instead of him having to write the rules for the program by hand. He pondered whether the device could do all the fine-tuning itself. With this idea in mind, he published a paper titled “Some Studies in Machine Learning Using the Game of Checkers.”*
Machine learning is the process in which a machine learns the variables of a problem and fine tunes them on its own instead of humans hard-coding the rules for reaching the solution.
Samuel’s publication marked the birth of machine learning. One of the two learning techniques that Samuel described in his paper was called rote learning. Today, this technique is known as memoization, a computer science strategy used to speed up computer programs. The other method involved measuring how good or bad a specific board position was for the computer or its human opponent. By improving the measurement of a board state, the program could become better at playing the game. In 1961, Samuel’s program beat the Connecticut state checker champion. It was the first time that a machine trumped a player in a state competition, a pattern that would repeat in the years to come.
If after I die, people want to write my biography, there is nothing simpler. They only need two dates: the date of my birth and the date of my death. Between one and another, every day is mine.Fernando Pessoa*
The birth of artificial intelligence was seen with the initial development of neural networks including Frank Rosenblatt’s creation of the perceptron model and the first demonstration of supervised learning. That led to the Georgetown-IBM experiment, an early language translation system. Finally, the end of the beginning was marked by the Dartmouth Conference, at which artificial intelligence was officially launched as a field in computer science, leading to the first government funding of AI.
In 1943, Warren S. McCulloch, a neurophysiologist, and Walter Pitts, a mathematical prodigy, created the concept of artificial neural networks. They designed their system based on how our brains work and patterned it after the biological model of how neurons—brain cells—work with each other. Neurons interact with their extremities, firing signals via their axon across a synapse to neighboring neurons’ dendrites. Depending on the voltage of this electrical charge, the receiving neuron proceeds to either fire a new charge of electrical pulse to the next set of neurons, or not.
Figure: Artificial neural networks are based on the simple principle of electrical charges and how they are passed in the brain.
The hard part of modeling the correct artificial neural network, that is, one that achieves the task that you are trying to solve, is that you need to figure out what voltage one neuron should pass to another as well as what it takes for a neuron to fire.
Both the voltages and the firing criteria become variables that need to be determined for the model. In an artificial neural network, the voltage that is passed from neuron to neuron is called a weight. These weights need to be trained so that the artificial neural network performs the task at hand. One of the earliest ways to do this is called Hebbian learning, which we’ll talk about next.
In 1947, around the same time that Arthur Samuel was working on the first computer that would beat a state checker champion, Donald Hebb, a Canadian psychologist with a PhD from Harvard University, became a Professor of Psychology at McGill University. Hebb would later be the first to develop the idea of neural networks.
In 1949, Hebb developed a theory known as Hebbian learning, which proposes an explanation for how our neurons fire and change when we learn something new. It states that when one neuron fires to another, the connection between them develops or enlarges. That means that whenever two neurons are active together, because of some sensory input or other reason, these neurons tend to become associated.
Therefore, the connections among neurons become stronger or grow when the neurons fire together, making the link between the two neurons harder to break. Hebb explained how that is the way humans learn. Hebbian learning, the process of making connections stronger between neurons that fire together, was the way to create artificial neural networks early on, but later, other techniques became more predominant.
The way this network of neurons become associated with a memory or some pattern that causes all these neurons to fire together became known as an engram. Gordon Allport defines engrams as, “If the inputs to a system cause the same pattern of activity to occur repeatedly, the set of active elements constituting that pattern will become increasingly strongly inter-associated. That is, each element will tend to turn on every other element and (with negative weights) to turn off the elements that do not form part of the pattern. To put it another way, the pattern as a whole will become ‘auto-associated.’ We may call a learned (auto-associated) pattern an engram.”*
With these models in mind, in the summer of 1951, Marvin Minsky, together with two other scientists, developed the Stochastic Neural Analog Reinforcement Calculator (SNARC)—a machine with a randomly connected neural network of approximately 40 artificial neurons.* The SNARC was built to try and find the exit from a maze in which the machine played the part of the rat.
Minsky, with the help of an American psychologist from Harvard, George Miller, developed the neural network out of vacuum tubes and motors. The machine first proceeded randomly, then the correct choices were reinforced by making it easier for the machine to make those choices again, thus increasing their probability compared to other paths. The device worked and made the imaginary rat find a path to the exit. It turned out that, by an electronic accident, they could simulate two or three rats in the maze at the same time. And, they all found the exit.
Minsky thought that if he “could build a big enough network, with enough memory loops, it might get lucky and acquire the ability to envision things in its head.”* In 1954, Minsky published his PhD thesis, presenting a mathematical model of neural networks and its application to the brain-model problem.*
This work inspired young students to pursue a similar idea. They sent him letters asking why he did not build a nervous system based on neurons to simulate human intelligence. Minsky figured that this was either a bad idea or would take thousands or millions of neurons to make work.* And at the time, he could not afford to attempt building a machine like that.
In 1956, Frank Rosenblatt implemented an early demonstration of a neural network that could learn how to sort simple images into categories, like triangles and squares.*
Figure: Frank Rosenblatt* and an image with 20x20 pixels.
He built a computer with eight simulated neurons, made from motors and dials, connected to 400 light detectors. Each of the neurons received a set of signals from the light detectors and spat out either a 0 or 1 depending on what those signals added up to.
Rosenblatt used a method called supervised learning, which is a way of saying that the data that the software looks at also has information identifying what type of data it is. For example, if you want to classify images of apples, the software would be shown photos of apples together with the tag “apple.” This approach is much like how toddlers learn basic images.
Figure: The Mark I Perceptron.
Perceptron is a supervised learning algorithm for binary classifiers. Binary classifiers are functions that determine if an input, which can be a vector of numbers, is part of a class.
The perceptron algorithm was first implemented on the Mark I Perceptron. It was connected to a camera that used a 20x20 grid of cadmium sulfide* photocells* producing a 400-pixel image. Different combinations of input features could be experimented with using a patchboard. The array of potentiometers on the right* implemented the adaptive weights.*
Rosenblatt’s perceptrons classified images into different categories: triangles, squares, or circles. The New York Times featured his work with the headline “Electronic ‘Brain’ Teaches Itself.”* His work established the principles of neural networks. Rosenblatt predicted that perceptrons would soon be capable of feats like greeting people by name. The problem is, however, that his algorithm did not work with multiple layers of neurons due to the exponential nature of the learning algorithm: it required too much time for perceptrons to converge to what engineers wanted them to learn. This was eventually solved, years later, by a new algorithm called backpropagation, which we’ll cover in the section on deep learning.
A multilayer neural network consists of three or more layers of artificial neurons—an input layer, an output layer, and at least one hidden layer—arranged so that the output of one layer becomes the input of the next layer.
Figure: A multilayer neural network.
The Georgetown-IBM experiment translated English sentences into Russian and back into English. This demonstration of machine translation happened in 1954 to attract not only public interest but also funding.* This system specialized in organic chemistry and was quite limited, with only six grammar rules. An IBM 701 mainframe computer, designed by Nathaniel Rochester and launched in April 1953, ran the experiment.*
A feature article in the New York Times read, “A public demonstration of what is believed to be the first successful use of a machine to translate meaningful texts from one language to another took place here yesterday afternoon. This may be the cumulation of centuries of search by scholars for a mechanical translator.”
Figure: The Georgetown-IBM experiment translated 250 sentences from English to Russian.
The demo worked in some cases, but it failed for most of the sentences. A way of verifying if the machine translated a phrase correctly was to translate it from English to Russian and then back into English. If the sentence had the same meaning or was similar to the original, then the translation worked. But in the experiment, many sentences ended up different from the original and with an entirely new meaning. For example, given the original sentence “The spirit is willing, but the flesh is weak,” the result was “The whiskey is strong, but the meat is rotten.”
The system simply could not understand the meaning, or semantics, of the sentence, making mistakes in translation as a result. The errors mounted, completely losing the original message.
AI was defined as a field of research in computer science in a conference at Dartmouth College in the summer of 1956. Marvin Minsky, John McCarthy, Claude Shannon, and Nathaniel Rochester organized the conference. They would become known as the “founding fathers” of artificial intelligence.
At the conference, these researchers wrote a proposal to the US government for funding. They divided the field into six subfields of interest: computers, natural language processing, neural networks, theory of computation, abstraction, and creativity.
From left to right: Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge, and Ray Solomonoff.
At the conference, many predicted that a machine as intelligent as a human being would exist in no more than a generation, about 25 years. As you know, that was an overestimation of how quickly development of artificial intelligence would proceed. The workshop lasted six weeks and started the funding boom into AI, which continued for 16 years until what would be called the First AI Winter.
The Defense Advanced Research Projects Agency (DARPA) poured most of the money that went into the field during the period known as the Golden Years in artificial intelligence.
During this “golden” period, the early AI pioneers set out to teach computers to do the same complicated mental tasks that humans do, breaking them into five subfields: reasoning, knowledge representation, planning, natural language processing (NLP), and perception.
These general-sounding terms do have specific technical meanings, still in use today:
Reasoning. When humans are presented with a problem, we can work through a solution using reasoning. This area involved all the tasks involved in that process. Examples include playing chess, solving algebra problems, proving geometry theorems, and diagnosing diseases.
Knowledge representation. In order to solve problems, hold conversations, and understand people, computers must have knowledge about the real world, and that knowledge must be represented in the computer somehow. What are objects, what are people? What is speech? Specific computer languages were invented for the purpose of programming these things into the computer, with Lisp being the most famous. The engineers building Siri had to solve this problem for it to respond to requests.
Planning. Robots must be able to navigate in the world we live in, and that takes planning. Computers must figure out, for example, how to move from point A to point B, how to understand what a door is, and where it is safe to go. This problem is critical for self-driving cars so they can drive around roads.
Natural language processing. Speaking and understanding a language, and forming and understanding sentences are skills needed for machines to communicate with humans. The Georgetown-IBM experiment was an early demonstration of work in this area.
Perception. To interact with the world, computers must be able to perceive it, that is, they need to be able to see, hear, and feel things. Sight was one of the first tasks that computer scientists tackled. The Rosenblatt perceptron was the first system to address such a problem.
The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.Edsger Dijkstra
The Golden Years of AI started with the development of Micro-Worlds by Marvin Minsky as well as John McCarthy’s development of Lisp, the first programming language optimized for artificial intelligence. This era was marked by the creation of the first chatbot, ELIZA, and Shakey, the first robot to move around on its own.
The years after the Dartmouth Conference were an era of discovery. The programs developed during this time were, to most people, simply astonishing. The next 18 years, from 1956 to 1974, were known as the Golden Years.* Most of the work developed in this era was done inside laboratories in universities across the United States. These years marked the development of the important AI labs at the Massachusetts Institute of Technology (MIT), Stanford, Carnegie Mellon University, and Yale. DARPA funded most of this research.*
MIT housed not a laboratory per se but what would be called Project MAC.* MAC was an acronym for Mathematics and Computation. The choice of creating a project instead of a lab stemmed from internal politics. Started by Robert Fano in July 1963, Project MAC would eventually turn into the Computer Science and Artificial Intelligence Lab (CSAIL) inside MIT. This project was responsible for research in the areas of artificial intelligence, operating systems, and theory of computation. DARPA provided a $2M grant for MIT’s Project MAC.
Marvin Minsky directed the AI Group inside Project MAC. John McCarthy was also a member of the group, and while there he created the high-level language Lisp in 1958, which became the dominant AI programming language for the next 30 years. At the time, credentialed computer scientists did not exist because universities did not have computer science programs yet. So, everyone involved in the project was either a mathematician, physicist, electrical engineer, or a dropout.
Figure: John McCarthy, Lisp language inventor.*
Project MAC was responsible for many inventions,* including the creation of the first computer-controlled robotic arm by Marvin Minsky and the first chess-playing* program. The program, developed by McCarthy’s students, beat beginner chess players and used the same main techniques as Deep Blue, the computer-chess program that would beat Grandmaster Garry Kasparov years later.
The world is composed of many environments, each with different rules and knowledge. Russian grammar rules differ from those of English, which are entirely different from the standards for geometry. In 1970, Minsky and Seymour Papert suggested constraining their research into isolated areas; that is, they would focus on Micro-Worlds.* They concentrated on specific domains to see if programs could understand language in an artificially limited context. Most of the computer programs developed during the Golden Years focused on these Micro-Worlds.
One such program was SHRDLU, which was written by Terry Winograd at the MIT AI Lab to understand natural language.* In this experiment, the computer worked with colored blocks using a robotic arm and a video camera. SHRDLU responded to commands typed in English, such as “Grasp the pyramid.” The goal of this process was to build one or more vertical stacks of blocks. Some blocks could not be placed on top of others, making the problem more complex.
But the tasks involved more than merely following commands. SHRDLU performed actions in order to answer questions correctly. For example, when the person typed, “Can a pyramid be supported by a pyramid?”, SHRDLU tried to stack two pyramids and failed. It then responded, “I can’t.” While many thought the SHRDLU program was a breakthrough, and it was considered a wildly successful demonstration of AI, Winograd realized that expanding outside the Micro-World for broader applications was impossible.
Figure: Marvin Minsky and his SHRDLU-controlled robotic arm.
After McCarthy left MIT in 1962,* he became a professor at Stanford, where he started a lab called the Artificial Intelligence Center.* The laboratory focused most of its energy on speech recognition, and some of their work became the foundation for Siri, Apple’s virtual assistant.* The laboratory also worked on robotics and created one of the first robots, Shakey. Developed from 1966 to 1972, it was the first robot to break down large tasks into smaller ones and execute them without a human directing the smaller jobs.
Shakey’s actions included traveling from one location to another, opening and closing doors, turning light switches on and off, and pushing movable objects.* The robot occupied a custom-built Micro-World consisting of walls, doors, and a few simple wooden blocks. The team painted the baseboards on each wall so that Shakey could “see” where the walls met the floor.
Lisp was the language used for the planning system, and STRIPS, the computer program responsible for planning Shakey’s actions, would become the basis for most automated planners. The robot included a radio antenna, television camera, processors, and collision-detection sensors. The robot’s tall structure and its tendency to shake resulted in its name. Shakey worked in an extremely limited environment, something critics pointed out, but even with these simplifications, Shakey still operated disturbingly slowly.
Figure: Shakey, the first self-driving robot.
Another prominent laboratory working on artificial intelligence was inside Carnegie Mellon University. At CMU, Bruce T. Lowerre developed Harpy, a speech recognition system.* This work started around 1971, and DARPA funded five years of the research. Harpy was a breakthrough at the time because it recognized complete sentences. One difficulty in speech is knowing when one word ends and another begins. For example, “euthanasia” could be misconstrued for “youth in Asia.” By 1976, Harpy could understand speech for 1,011 words from different speakers and translate it into text with a 90% accuracy rate.*
The Automatic Language Processing Committee (ALPAC) was created in 1964 by the US government “to evaluate the progress in computational linguistics in machine translation.”* By 1966, the committee reported it was “very skeptical of research done in machine translation so far, and emphasiz[ed] the need for basic research in computational linguistics” instead of AI systems. Because of this negative view, the government greatly reduced its funding.
At Yale, Roger Schank and his team used Micro-Worlds to explore language processing. In 1975, the group began a program called SAM, an acronym for Script Applier Mechanism, that was developed to answer questions about simple stories concerning stereotypical matters such as dining in a restaurant and traveling on the subway.
The program could infer information that was implicit in the story. For example, when asked, “What did John order?” SAM replied, “John ordered lasagna,” even though the story stated only that John went to a restaurant and ate lasagna.* Schank’s team worked on a few different projects, and in 1977, their work also included another computer program called FRUMP, which summarized wire-service news reports into three different languages.
At IBM, Nathaniel Rochester and his colleagues produced some of the first AI programs. In 1959, Herbert Gelernter constructed the Geometry Theorem Prover, a program capable of proving theorems that many students of mathematics found quite tricky. His program “exploited two important ideas. One was the explicit use of subgoals (sometimes called ‘reasoning backward’ or ‘divide and conquer’), and the other was the use of a diagram to close off futile search paths.”* Gelernter’s program created a list of goals, subgoals, sub-subgoals, and so on, expanding more broadly and deeply until the goals were solvable. The program then traversed this chain to prove the theorem true or false.
A heuristic is a rule that helps to find a solution for a problem by making guesses about the best strategy to use given the state.*
In 1961, James Slagle wrote the program SAINT, Symbolic Automatic Integrator, which was responsible for solving simple algebra equations. The SAINT system performed integration through a “heuristic” processing system.
SAINT divided the problem into subproblems, searched those for possible solutions, and then tested them. As soon as these subproblems were solved, SAINT could resolve the main one as well.
SAINT became the foundation for Wolfram Mathematica, which is a valuable tool widely used today in the scientific, engineering, and computational fields. SAINT, however, was not the only program that addressed school problems. Others, such as Daniel Bobrow’s program called “word problems,” solved algebra problems described in simple sentences like, “The consumption of my car is 15 miles per gallon.”*
Figure: ELIZA software running on a computer.
Created by Joseph Weizenbaum in 1964, ELIZA was the first version of a chatbot.* It spammed people and did not pass the Turing test, but it was an early natural language processing program that demonstrated where AI could head in the future. It talked to anyone who typed sentences into a computer terminal with it installed.
ELIZA simply followed a few rules to try and identify the most important keywords in a sentence. With that information, the program attempted to reply to the questions based on that content. ELIZA disassembled the input and then reassembled it, creating a response using data entered by the user. For example, if the user entered, “You are very helpful.” ELIZA would take the input and first create the sentence, “What makes you think I am,” then it would add the rest from the deconstructed initial input, leading to the final sentence, “What makes you think I am very helpful?” If the program could not find such keywords, ELIZA responded with a remark that lacked content, like “Please go on.” or “I see.” ELIZA and today’s Alexa would not be too different from each other.
It’s difficult to be rigorous about whether a machine really ‘knows’, ‘thinks’, etc., because we’re hard put to define these things. We understand human mental processes only slightly better than a fish understands swimming.John McCarthy*
The First AI Winter started with funds drying up after many of the early promises did not pan out as expected. The most famous idea coming out of this era was the Chinese room argument, one that I personally disagree with, that states that artificial intelligence systems can never achieve human-level intelligence.
From 1974 to 1980, AI funding declined drastically, making this time known as the First AI Winter. The term AI winter was explicitly referencing nuclear winters, a name used to describe the aftermath of a nuclear attack when no one can live in the area due to the high radiation. In the same way, AI research was in such chaos that it would not receive funding for many years.
Critiques and financial setbacks, a consequence of the many unfulfilled promises during the early boom in AI, caused this era. From the beginning, AI researchers were not shy about making predictions of their future successes. The following statement by Herbert Simon in 1957 is often quoted, “It is not my aim to surprise or shock you … but the simplest way I can summarize is to say that there are now in the world machines that think, that can learn and that can create. Moreover, their ability to do these things is going to increase rapidly until—in a visible future—the range of problems they can handle will be coextensive with the range to which the human mind has been applied.”*
Terms such as “visible future” can be interpreted in various ways, but Simon also made more concrete predictions. He said that within 10 years a computer would be a chess champion and a machine would prove a significant mathematical theorem. With Deep Blue’s victory over Kasparov in 1996 and the proof of the Four Color Theorem in 2005 using general-purpose theorem-proving AI, these predictions came true within 40 years, 30 years longer than predicted. Simon’s overconfidence was due to the promising performance of early AI systems on simple examples. However, in almost every case, these early systems turned out to fail miserably when applied to broader or more difficult problems.
The first type of complication arose because most early programs knew nothing of their subject matter but rather succeeded using simple syntactic manipulations. A typical story occurred in early machine translation efforts, which were generously funded by the US National Research Council in an attempt to speed up the translation of Russian scientific papers in the wake of the Sputnik launch in 1957. It was thought initially that simple syntactic transformations, based on the grammar rules of Russian and English and word replacements from an electronic dictionary, would suffice to preserve the exact meanings of sentences. The fact is that accurate translation requires background knowledge to resolve ambiguity and establish the content of the sentence. A report by ALPAC criticizing machine translation efforts caused another setback. After spending $20M, The National Academy of Sciences, Engineering, and Medicine ended support for AI research based on this report.
Much criticism also came from AI researchers themselves. In 1969, Minsky and Papert published a book-length critique of perceptrons, the basis of early neural networks.* They claimed that a neural network with more than one layer would not be powerful enough to be useful to replicate intelligence. Ironically, multilayer neural networks, also known as deep neural networks (DNNs), would eventually cause an enormous revolution in multiple tasks, including language translation and image recognition, and become the go-to machine learning technique for researchers.
In 1973, following the same pattern of criticism of AI research, a report known as the Lighthill Report, written by James Lighthill for the British Science Research Council, gave a very pessimistic forecast of the field.* It stated, “In no part of the field have discoveries made so far produced the major impact that was then promised.” Following this report and others, DARPA withdrew its funding from the Speech Understanding Research at CMU, canceling $3M of annual grants. Another significant setback for AI funding was because of the Mansfield Amendment, passed by Congress, which limited military funding for research that lacked a direct or apparent relationship to a specific military function.* It resulted in DARPA funding drying up for many AI projects.
Criticism came from everywhere, including philosophers. Hubert Dreyfus, an MIT Philosophy Professor, criticized what he called the two assumptions of AI: the biological and psychological assumptions.*
The biological assumption refers to the brain being analogous to computer hardware and the mind being equivalent to computer software. The psychological assumption is that the mind performs discrete computations on discrete representation or symbols. Unfortunately, these concerns were not taken seriously by AI researchers. Dreyfus was given the cold shoulder and later claimed that AI researchers “dared not be seen having lunch with me.”
One of the strongest and most well-known arguments against machines ever having real intelligence marked the end of the AI Winter. In 1980, John Searle, a philosophy professor at the University of California, Berkeley, introduced the Chinese room argument as a response to the Turing test. This argument proposed that a computer program cannot give a computer a mind, understanding, or consciousness.
Searle compared a machine’s understanding of the Chinese language to the understanding of someone who does not know Chinese but can read the dictionary and translate every word from English to Chinese and vice versa. In the same way, a machine would not have real intelligence. His argument stated that even if the device passed the Turing test, it did not mean that the computer literally had intelligence. A computer that translates English to Chinese does not necessarily understand Chinese. It could indicate that it is merely simulating the intelligence needed to understand Chinese.
Searle used the term Strong AI to refer to a machine with real intelligence, the equivalent of understanding Chinese instead of only translating word by word. But in the case when computers do not have real intelligence, such as simply translating Chinese words instead of actually understanding the meanings of the words, the machine has Weak AI.
Figure: Representation of the Chinese room argument.
In the Chinese room argument, a person inside a room who has a dictionary and translates English sentences and spills out Chinese sentences does not really understand Chinese.
Searle said that there would not be a theoretical difference between him using a device that translates directly from a dictionary versus using one that understands Chinese. Each simply follows a program, step by step, which demonstrates a behavior described as intelligence. But in the end, the one who translates using a dictionary would not be able to understand Chinese, which is the same way a computer can appear to have intelligence but only the appearance of having it. Some argue that Strong AI and Weak AI are the same with no real theoretical difference between the two.
The problem with this argument is that there is not a clear boundary between Weak AI and Strong AI. How can you determine whether someone really understands Chinese or they are mimicking the behavior? If the machine can translate every possible sentence, is that understanding or mimicking?
It’s fair to say that we have advanced further in duplicating human thought than human movement.Garry Kasparov*
This era was marked by expert systems and increased funding in the 80s. The development of Cog, iRobot, and Roomba by Rodney Brooks and the creation of Gammonoid, the first software to win backgammon against a world champion both took place during this period. This era ended with Deep Blue, the first computer software to win against a world-champion chess player.
After the First AI Winter, research picked up with new techniques that showed great results, heating up investment in research and development in the area and sparking the creation of new AI applications in enterprises. Simply put, the 1980s saw the rebirth of AI.
The research up until 1980 focused on general-purpose search mechanisms trying to string together elementary reasoning steps to find complete solutions. Such approaches were called weak methods because, although they applied to general problems, they did not scale up to larger or more difficult situations.
The alternative to these weak methods was to use more powerful domain-specific knowledge in expert systems that allowed more reasoning steps and could better handle typically occurring cases in narrow areas of expertise.* One might say that to solve a hard problem, you almost have to know the answer already. This philosophy was the leading principle for the AI boom from around 1980 to 1987.
The basic components of an expert system are the knowledge base and the inference engine. The knowledge base consists of all the important data for the domain-specific task. For example, in chess, this knowledge includes all the game’s rules and the points that each piece represents when playing. The information in the knowledge base is typically obtained by surveying experts in the area in question.
The inference engine enables the expert system to draw conclusions through simple rules like: If something happens, then something else happens. Using the knowledge base with these simple rules, the inference engine figures out what the system should do based on what it observes. In a chess game, the system can decide which piece to move by analyzing which moves are possible and which are the best ones based on the pieces remaining on the board. The inference engine makes a decision based on this knowledge.
To understand the upsurge in AI, we must look at the state of computer hardware. Personal computers made huge strides from the early to mid-80s. While the Tandem/16 was available, the Apple II, TRS-80 Model I, and Commodore PET were marketed to single users for a much lower price and better sound and graphics, although with less power regarding memory. The target use for these computers was for word processing, video games, and school work. As the 1980s progressed, so did computers with the release of Lotus 1-2-3 in 1983 and the introduction of the Apple Macintosh and the modern graphical user interface (GUI) in 1984. Personal computing hardware exploded by leaps and bounds during this time. And, the same was true for AI.
The domain-specific systems of expert systems also proliferated. Large corporations around the world began adopting these systems because they leveraged desktop computers rather than expensive mainframes. For example, expert systems helped Wall Street firms automate and simplify decision making in their electronic trading systems. Suddenly, some people started assuming that computers could be intelligent again.
AI research and development projects received over $1B, and universities around the world with AI departments cheered. Companies developed not only new software but also specialized hardware to meet the needs of these new AI applications. New computer languages, including Prolog and Lisp, were developed to address the needs of these applications. Prolog, for example, was built around the rule-based system with primitives, such as a “rule,” that define and build on these if and else statements.
Many AI companies were created, including hardware-specific companies like Symbolics and Lisp Machines and software-specific companies like Intellicorp and Aion. Venture capital firms, which invest in early-stage companies, emerged to fund these new tech startups with visions of billion-dollar exits. For the first time, technology firms received a dedicated pool of money, which sped up the development of AI.
In part, the explosion of expert systems was due to the success of XCON (for eXpert CONfigurer), which was written by John P. McDermott at CMU in 1978.* An example of a rule that XCON had in its repertoire was:
If: the current context is assigning devices to unibus modules and there is an unassigned dual port disk drive and the type of controller it requires is known and there are two such controllers neither of which has any devices assigned to it and the number of devices that these controllers can support is known
Then: assign the disk drive to each of the controllers and note that the two controllers have been associated and that each supports on device*
Digital Equipment Corporation (DEC) used the system to automatically select computer components that met user requirements, even though internal experts sometimes disagreed regarding the best configuration. Before XCON, individual components, such as cables and connections, had to be manually selected, resulting in mismatched components and missing or extra parts. By 1986, XCON had achieved 95% to 98% accuracy and saved $25M annually, arguably saving DEC from bankruptcy.
Figure: Rodney Brooks, with his two robots, Sawyer and Baxter.*
Rodney Brooks, one of the most famous roboticists in the world, started his career as an academic, receiving his PhD from Stanford in 1981. Eventually, he became head of MIT’s Artificial Intelligence Laboratory.
At MIT, Rodney Brooks* defined the subsumption architecture, a reactive robotic architecture.* Robots that followed this architecture guided themselves and moved around based on reactive behaviors. That meant that robots had rules on how to act based on the state of the world and would react to the world as it was at that moment. This was different from the standard way of programming robots, in which they created a model of the world and made decisions based on that potentially stale information.
For Brooks, robots had a set of sub-behaviors that were organized in a hierarchy. Robots interacted with the world and reacted to it based on those behaviors. This theory and the demonstration of his work made him famous worldwide and, consequently, slowed down robotics research in Japan. At the time, Japanese research focused on a different software architecture for robotics, and with the success of this new theory, investment in other types of robotics research dwindled, especially in Japan.
Based on the subsumption architecture, Brooks developed a robot that could grab a coke bottle, something that was unimaginable before. He believed that the key to achieving intelligence was to build a machine that experienced the world in the same way a human does. Brooks was considered a maverick in his field. He used to say, “Most of my colleagues here in the lab do very different things and have only contempt for my work,”* saying that most of the other researchers were pursuing GOFAI, or Good Old-Fashioned Artificial Intelligence, or “brain in the box.” Instead, he stated that his robots had full knowledge about the real world and interacted with it. He used to say that GOFAI is like intelligence that only has access to the Korean dictionary: it can have self-consistent definitions, but they are unconnected to the world in any way.
In his 1990 paper, “Elephants Don’t Play Chess,”* he took direct aim at the physical symbol system hypothesis,* which states that a system based on rules for managing symbols has the necessary and sufficient means for general intelligent actions. It implies that human thinking can be reduced to symbol processing and that machines can achieve artificial general intelligence with a symbolic system.
For example, a chess game can be reduced to a symbolic mathematical system where pieces become symbols that are manipulated in each turn. The physical symbol system hypothesis stated that all types of thinking boiled down to mathematical formulas that can be manipulated. Brooks disagreed with the theory, arguing that symbols are not always necessary since “the world is its own best model. It is always exactly up to date. It always has every detail there is to be known. The trick is to sense it appropriately and often enough.” Therefore, you could create intelligent machines without having to create a model of the world.
Brooks started his research working on robotic insects that could move around their environment. At the time, researchers could not build robotic insects that moved quickly, even though that seems trivial. Brooks argued that the reason why robots from other researchers were slow is that they were GOFAI robots that relied on a central computer with a three-dimensional map of the terrain. He argued that such a system was not necessary and was even cumbersome for achieving the task of making robots move around their environment.
Brooks’s robots were different. Instead of having a central processor, each of his insects’ legs contained a circuit with a few simple rules, such as telling the leg to swing if it was in the air or move backward if it was on the ground. With these rules tied together, plus the interaction of the body with the circuits, the robots would walk similarly to how actual insects walk. The computation that the robots performed was always coupled physically with their body.
Figure: Cog with Rodney Brooks.*
Brooks’s initial plan was to move up the biological ladder of the animal kingdom. First, he would work on robots that looked like insects, then an iguana, a simple mammal, a cat, a monkey, and eventually a human. But he realized that his plan would take too long, so he jumped directly from insects to building a humanoid robot named Cog.*
Cog was composed of black and chrome metal beams, motors, wires, and a six-foot-tall rack of boards containing many processing chips, each as powerful as a Mac II. It had two eyes, each consisting of a pair of cameras: one fisheye lens to give a broad view of what was going on, and one normal lens that gave this humanoid a higher-resolution representation of what was directly in front of it.
Cog had only the basics from the software standpoint: some primitive vision, a little comprehensive hearing, some sound generation, and rough motor control. Instead of having the behavior programmed into it, Cog developed its behavior on its own by reacting to the environment. But it could do little besides wiggle its body and wave its arm. Ultimately, it ended up in a history museum in Boston.
Figure: Roomba robot cleaner.*
In 1990, after his stint developing Cog, Professor Brooks started iRobot. The name is a tribute to Isaac Asimov’s science fiction book I, Robot. In the years ahead, the company built robots for the military to perform tasks like disarming bombs. Eventually, iRobot became well-known for creating home robots. One of its most famous robots is a vacuum cleaner named Roomba.
The way ants and bees search for food inspired the software behind Roomba.* When the robot starts, it moves in a spiral pattern. It spans out over a larger and larger area until it hits an object. When it encounters something, it follows the edge of that object for a period of time. Then, it crisscrosses, trying to determine the largest distance it can go without hitting something else. This process helps Roomba work out how large the space is. But if it goes too long without hitting a wall, the vacuum cleaner starts spiraling again because it figures it is in an open space. It constantly calculates how wide the area is.
Figure: A visualization of how the Roomba algorithm works.
Roomba combines this strategy with another one based on its underneath dirt sensors. When it detects dirt, it changes its behavior to cover the immediate area. It then searches for another dirty area on a straight path. According to iRobot, these different combined patterns create the most effective way to traverse a room. At least 10 million Roombas have been sold worldwide.
Brooks left iRobot to start a new company, Rethink Robotics, which specializes in making robots for manufacturing. In 2012, they developed their first robot, named Baxter, to help small manufacturers pack objects. Rethink Robotics introduced Sawyer in 2015 to perform more detailed tasks.
Professor Hans Berliner from CMU created a program called BKG 9.9 that in July 1979 played the world backgammon champion in Monte Carlo.* The program controlled a robot called Gammonoid, whose software was one of the largest examples of an expert system. Gammonoid’s software was running on a giant computer at CMU in Pittsburgh, Pennsylvania, 4,000 miles away, and gave the robot instructions via satellite communication. The winner of the human-versus-robot games would take home $5K. Not much was expected of the programmed robot because the players knew that the existing microprocessors could not play backgammon well. Why would a robot be any different?
The opening ceremony reinforced the view that the robot would not win. When appearing on the stage in front of everyone, the robot entangled itself in the curtains, delaying its appearance. Despite this, Gammonoid became the first computer program to win the world championship of a board or card game.* It won seven games to one. Luigi Villa, the human opponent, could hardly believe it and thought the program was lucky to have won two of the games, the third and the final one. But Professor Berlinger and some of his researchers had been working on this backgammon machine for years.
The former world champion, Paul Magriel, commented on the game, “Look at this play. I didn’t even consider this play … This is certainly not a human play. This machine is relentless. Oh, it’s aggressive. It’s a really courageous machine.” Artificial intelligence systems were starting to show their power.
With this new power came the birth of systems like HiTech and Deep Thought, which would eventually defeat chess masters. These were the precursors of Deep Blue, the system that would become the world-champion chess software, and they were all developed in laboratories at Carnegie Mellon University.
Deep Thought, initially developed by researcher Feng-hsiung Hsu in 1985, was sponsored and eventually bought by IBM. Deep Thought went on to win against all other computer programs in the 1988 American World Computer Chess Championship. The AI then competed against Bent Larsen, a Danish chess Grandmaster, and became the first computer program to win a chess match against a Grandmaster. The results impressed IBM, so they decided to bring development in-house, acquiring the program in 1995 and renaming it Deep Blue.
Figure: Garry Kasparov playing against IBM’s Deep Blue.
In the following year, 1996, it competed against the world chess champion, Garry Kasparov. Deep Blue became the first machine to win a chess game against a reigning world champion under regular time controls, although it ultimately lost the series. They played six matches; Deep Blue won one, drew two, and lost three.
The next year in New York City, Deep Blue—now unofficially called Deeper Blue—improved and won the match series against Garry Kasparov, winning two, drawing three, and losing one. It was the first computer program to win the chess world championship.
technical The software behind Deep Blue, which won against Kasparov, was running an algorithm called min-max search with a few additional tricks. It searched through millions or billions of possibilities to find the best move. Deep Blue was able to look at an average of 100 million chess positions per second while computing a move. It analyzed the current state and figured out the potential next moves based on the game rules. The program also calculated the possible “value” of the next play based on each player’s best moves after that move and what those would mean to the game. This process used an inference system on a knowledge base—an expert system.
Deep Blue’s algorithm minimized the possible loss for a worst-case scenario while maximizing the minimum gain for a potential win. It maximized the points that the software would get in the next moves and minimized the points that the opponent could get given a certain play. Hence the name min-max search. In a chess match, ways exist to identify the outcome of a state of a chessboard by looking at how many chess pieces remain on the table for each player. Each piece has a value attached to it. For example, a knight is worth three points, a rook five, and a queen ten. So, in a state where a player loses a rook and a knight, that position is worth eight points less for that player.
Figure: An example of the min-max algorithm. We can calculate the “value” of each board by looking at the value at the bottom positions in the tree. For example, when there is one white knight and one black knight and a black rook, the total value of the board for the white player is 3-(3+5)=-5. So, we can calculate the value of each board at the bottom level. The value of the board at the next level up is the minimum value of all the ones below it, giving -8 and -5 as the board values for the second row. And at the top row, we use the maximum value, which is -5.
In another example, pawns could be worth one point, bishops and knights three, and a queen nine. The figure above depicts a simplified version of a game and demonstrates how the min-max algorithm works. On the first play, the white bishop can take a black bishop or knight, and either move would mean that the opponent would lose three points. So, both moves would be the same if they were the only factors involved in the search for the best move. But the software also analyzes the moves that the opponent could make in response and determines what the best move is for the opponent. If the bishop takes the opponent’s knight on the first move (the board on the left), the best countermove for the opponent is to take the bishop with its tower. That is not so good for the white player, the computer, because black would take its bishop, removing the three points it gained. The other possible first move for the white player is to take the bishop (the right board). If it does that, the other player cannot take any white piece, which is very good for the white player, because it would be up three points. So, if the software analyzed only the next two moves, that would be the best move. Therefore, looking at only the possible next two moves, taking the bishop is the better of the two options.
Deep Blue, however, examined more than two moves ahead. It looked at all the possible future moves that it could, based on its time limit and the information it had, and chose the best move. This is why it needed to analyze so many moves per second. Deep Blue was the first such system to analyze that many possible future scenarios. This system and many of the future developments in artificial intelligence required a lot of computing power. That is no surprise because the human brain also has a very powerful computing capability. When playing chess, human players also examine future moves, but humans rely on their memory of past games and the best moves in a certain scenario. The development of more complex games like Go used memory like this, but for chess, it was not necessary since Deep Blue could analyze most of the possible future scenarios while playing the game. For more complex games, software simply cannot look at all possible scenarios in a timely manner.
I am inclined to doubt that anything very resembling formal logic could be a good model for human reasoning.Marvin Minsky*
Beginning in 1987, funds once again dried up for several years. The era was marked by the qualification problem, which many AI systems encountered at the time.
An expert system requires a lot of data to create the knowledge base used by its inference engine, and unfortunately, storage was expensive in the 1980s. Even though personal computers grew in use during the decade, they had at most 44MB of storage in 1986. For comparison, a 3-minute MP3 music file is around 30MB. So, you couldn’t store much on these PCs.
Not only that, but the cost to develop these systems for each company was difficult to justify. Many corporations simply could not afford the costs of AI systems. Added to that were problems with limited computing power. Some AI startups, such as Lisp Machines and Symbolics, developed specialized computing hardware that could process specialized AI languages like Lisp, but the cost of the AI-specific equipment outweighed the promised business returns. Companies realized that they could use far cheaper hardware with less-intelligent systems but still obtain similar business outcomes.
A warning sign for the new wave of interest in AI was that expert systems were unable to solve specific, computationally hard logic problems, like predicting customer demand or determining the impact of resources from multiple, highly variable inputs. Newly introduced enterprise resource planning (ERP) applications started replacing expert systems. ERP systems dealt with problems like customer relationship management and supplier relationship management, and they proved very valuable to large enterprises.
The qualification problem states that there is no way to predict all the possible outcomes and circumstances preventing the successful of an action, but the system still must recover from these unexpected failures. Reasoning agents in real-world environments rely on a solution to the qualification problem to make useful predictions.
For example, imagine that a program needs to drive a car with only if and then rules. A multitude of unexpected cases makes it impossible to handwrite all the rules for the application. Identifying cars and pedestrians is already extremely hard. A self-driving car not only needs to identify objects but also needs to drive around things (and people) based on its detection of such objects. Most people do not think of all the possible cases before they start writing the program. For example, if the program detects a human, is that human a pedestrian, a reflection, or someone riding in the bed of a pickup truck? The program also needs to be able to tell when vehicles tow other vehicles. These examples only scratch the surface of possible exceptions to the rules.
Expert systems fell prey to the qualification problem, and that caused a collapse of funding in AI funding because the systems could not achieve much of what it promised. The Second AI Winter began with the sudden collapse of the market for specialized AI hardware in 1987.* Desktop computers from IBM and Apple were steadily gaining market share. But 1987 became the turning point for these AI manufacturers when Apple’s and IBM’s computers became more powerful and cheaper than the specialized Lisp machines. Not only that, but Reagan’s Star Wars missile defense program experienced a huge slowdown because DARPA had invested heavily in AI solutions. This event, in turn, severely damaged Symbolics, one of the main Lisp machine makers, creating a cascading effect.
Figure: One of the computers developed by the Fifth Generation Computer Systems program.
In addition to that, Fifth Generation Computer Systems, which was an initiative by the Japanese government to create a computer using massively parallel computing, was shut down during this period. The name of the project came from the fact that up until this time, there had been four generations of computing hardware:
Transistors and diodes,
Integrated circuits, and
The Japanese initiative represented a new generation of computers. Previously, computers focused on increasing the number of logic components in a single central processing unit (CPU), but Japan’s project, and others of its time, focused on boosting the number of CPUs for better performance. This enormous computer was intended to be a platform for future development in artificial intelligence. Its goal was to respond to natural language input and be capable of learning. But general-purpose Intel x86 machines and Sun workstations had begun surpassing specialized computer hardware. Because of that and the high cost of the project—around $500M in total—the Japanese cut the initiative after a decade. The project’s end marked a failure of the massively parallel processing approach to AI.
In the United States, most of the projects of this era were also not working as expected. Eventually, the first successful expert system, XCON, proved too expensive to maintain. The system was complicated to update, could not learn, and suffered from the qualification problem.
The Strategic Computing Initiative (SCI),* another large program developed by the US government from 1983 to 1993, was inspired by Japan’s Fifth Generation Computer Systems project. It focused on chip design, manufacturing, and computer architecture for AI systems. The integrated program included projects at companies and universities that were designed to eventually come together. Funded by DARPA, the effort “was supposed to develop a machine that would run ten billion instructions per second to see, hear, speak, and think like a human.”* By the late 1980s, however, it was apparent that the initiative would not succeed in its AI goals, leading DARPA to cut funding “deeply and brutally.” This event, in addition to the numerous companies that had gone out of business, led to the Second AI Winter. The beginning of probabilistic reasoning marked the end of this winter and provided an altogether new approach to AI.
Probability is orderly opinion and inference from data is nothing other than the revision of such opinion in the light of relevant new information.*
Probabilistic reasoning was a fundamental shift from the way that problems were addressed previously. Instead of adding facts, researchers started using probabilities for the occurrence of facts and events, building networks of how the probability of each event occurring affects the probability of others. Each event has a probability associated with it, as does each sequence of events. These probabilities plus observations of the world are used to determine, for example, what is the state of the world and what actions are appropriate to take.
Probabilistic reasoning involves techniques that leverage the probability that events will occur. Judea Pearl’s influential work, in particular with Bayseian networks, gave new life to AI research and was central to this period. Maximum likelihood estimation was another important technique used in probabilistic reasoning. IBM Watson, the last successful system to use probabilistic reasoning, built on these foundations to beat the best humans at Jeopardy!
The work pioneered by Judea Pearl marked the end of the Second AI Winter.* His efforts ushered in a new era, arguably creating a fundamental shift in how AI was applied to everyday situations. One could even go so far as to say that his work laid much of the groundwork for artificial intelligence systems up to the end of the 1990s and the rise of deep learning. In 1985, Pearl, a professor at the University of California, Los Angeles, introduced the concept of Bayesian networks.* His new approach made it possible for computers to calculate probable outcomes based on the information they had. He had not only a conceptual insight but also a technical framework to make it practical. Pearl’s 1988 book, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, became the bible of AI at the time.
The techniques developed in this area were much more useful than just logic alone because probabilities represent more information than a conclusion. For example, stating “there is a 75% chance of rain in the next hour” conveys more information than “it is going to rain in the next hour,” especially because events in the future are not certain.
Named after the 18th-century mathematician Thomas Bayes, Dr. Pearl’s work on Bayesian networks provided a basic calculus for reasoning with uncertain information, which is everywhere in the real world. In particular, Pearl’s technique was the foundational framework for reasoning with imperfect data that changed how people approached real-world problem solving. Pearl’s research was instrumental in moving machine-based reasoning from the rules-bound expert systems of the 1980s to a calculus that incorporated uncertainty and probabilistic models. In other words, he figured out methods for trying to draw the best conclusion even when there is a degree of unpredictability.
Bayesian networks were applied when trying to answer questions from a vast amount of unstructured information or when trying to figure out what someone said in languages like Chinese that have many similar-sounding words. His work applied to an extensive range of applications, from medicine and gene work to information retrieval and spam filtering, in which only partial information is available.*
Bayesian networks provided a compact way of representing probability distributions. The Bayesian network formalism was invented to allow efficient representation and rigorous reasoning with uncertain knowledge. This approach largely overcame many problems of the systems of the 1960s and 1970s, and, simply put, dominated AI research on uncertain reasoning. In the 1960s, to overcome the problems of time and space complexity, simplifying assumptions had to be made, and the systems were small in scale and computationally expensive.* The 1970s shifted to using probability theory, but unfortunately, this theory could not be applied straightforwardly. Even with modifications, it could not solve the problems of uncertainty.*
Bayesian networks are a way of representing the dependencies between events and how the occurrences or probabilities of events affect the probabilities of other events. They are based on Bayes’s theorem that states that the probability of an event happening depends on whether other relevant events have happened or the probability that they will happen.
For example, the likelihood of a person having cancer increases as the age of that person goes up. Therefore, a person’s age can be used to more accurately assess that they have cancer. Not only that, but Bayes’s rule also applies to the other side of the equation. For example, if you find out that someone has cancer, then the probability that the person is older is higher.
Figure: A Bayesian Network.
This figure shows an example of a simple Bayesian network. Each node has a table with the associated probability depending on leading nodes. You can calculate the probability that the grass is wet given that it is raining and the sprinkler is not on. Or you could, for example, determine the chances that a sprinkler is running if it is not raining and the grass is wet. The diagram simply shows the probability of one outcome based on the previous ones. Bayesian networks help determine the probability of something happening given the observation of other states.
In this case, the rules are simple to access, but when you have a lot of dependent events, Bayesian networks are a way of representing them and their dependencies. For example, stock prices may depend on many factors, including public sentiment, the central bank interest rates and bond prices, and the trading volume at the moment. A Bayesian network represents all these dependencies.
Bayesian networks addressed these problems by adding a framework that researchers could use when dealing with them. Even though they were useful in the 1990s, probabilistic reasoning does not address all possible cases due to the qualification problem described in the previous chapter. That is why, I believe, that probability reasoning fell out of favor in the early 2000s and deep learning has taken over the field since then. A few probabilities cannot describe how complex the world is.
technical Another technique used frequently during these years was maximum likelihood estimation (MLE). Based on a model of how the world should work and the observation of what is in the world, maximum likelihood estimation tries to determine the value of certain variables that would maximize the probability of such observation happening.
The idea behind it is that if you observe enough events occurring in the world, you would have enough samples to estimate the real probability distribution of these events. MLE finds the parameters of such a model that would best fit the observed data.
Figure: A normal distribution of heights of a given population.
For example, let’s say that you know that a normal distribution best describes the height of individuals for a certain country, like in the figure above. The y-axis represents the number of people with a certain height, and the x-axis represents the heights of the individuals. So in the center of this curve, we know that there are many people that have average height, and then as we move farther from the center on either side, there are fewer taller or shorter people.
With this technique, you can poll a lot of people to find out their height, and based on the data, you can determine the real distribution across the entire population. As shown in the figure below, after receiving the responses, you can then assume that the distribution of the heights of the entire population will be the one that maximizes the likelihood of those responses, the curved line from the first figure. The information on the distribution of heights of people inside a country can be useful for many applications. And with MLE, you can determine the most likely scenario for the heights of the population by surveying only a portion of the population.
Figure: This image shows the corresponding responses from a set of people. The curves on top of it are the assumed models based on the data given by the answers and the assumed normal distribution, using MLE. The blue line represents the height of men in the study’s population and the pink, the height of women.
In many ways, the use of probability for inference marked this period. It preceded the revolution that multilayer neural networks, also known as deep learning, would cause in the field. Probabilistic reasoning was successful in many applications and reached its peak with the development of IBM Watson. While Watson did not use Bayesian networks or maximum likelihood estimation for its calculations, it used probabilistic reasoning to determine the most likely answer.
Watson was a project developed from 2004 to 2011 by IBM to beat the best humans at the television game show Jeopardy! The project was one of the last successful systems to use probabilistic reasoning before deep learning became the go-to solution for most machine learning problems.
Since Deep Blue’s victory over Garry Kasparov in 1997, IBM had been searching for a new challenge. In 2004, Charles Lickel, an IBM Research Manager at the time, identified the project after a dinner with co-workers. Lickel noticed that most people in the restaurant were staring at the bar’s television. Jeopardy! was airing. As it turned out, Ken Jennings was playing his 74th match, the last game he won.
Figure: The computer that IBM used for IBM Watson’s Jeopardy! competition.
Intrigued by the show as a possible challenge for IBM, Lickel proposed the idea of IBM competing against the best Jeopardy! players. The first time he presented the idea, he was immediately shut down, but that would change. The next year, Paul Horn, an IBM executive, backed Lickel’s idea. In the beginning, Horn found it challenging to find someone in the department to lead the project, but eventually, David Ferrucci, one of IBM’s senior researchers, took the lead. They named the project Watson after the father and son team who led IBM from 1914 to 1971, Thomas J. Watson Sr. and Jr.
In the Deep Blue project, the chess rules were entirely logical and could be easily reduced to math. The rules for Jeopardy!, however, involved complex behaviors, such as language, and were much harder to solve. When the project started, the best question-answering (QA) systems could only answer questions in very simple language, like, “What is the capital of Brazil?” Jeopardy! is a quiz competition where contestants are presented with a clue in the form of an answer, and they must phrase their response as a question. For example, a clue could be: “Terms used in this craft include batting, binding, and block of the month.” The correct response would be “What is quilting?”
IBM had already been working on a QA system called Practical Intelligent Question Answering Technology (Piquant)* for six years before Ferrucci started the Watson project. In a US government competition, Piquant correctly answered only 35% of the questions and took minutes to do so. This performance was not even close to what was necessary to win Jeopardy!, and attempts to adapt Piquant failed. So, a new approach to QA was required. Watson was the next attempt.
In 2006, Ferrucci ran initial tests of Watson and compared the results against the current competition. Watson was far below what was needed for live competition. Not only did it only respond correctly 15% of the time, compared to 95% for other programs, Watson was also slower. Watson had to be much better than the best software system at the time to have even the slightest chance to win against the best humans. The next year, IBM staffed a team of 15 and gave a timeframe of three to five years. Ferrucci and his team had much work to do.* And, they succeeded. In 2010, Watson was successfully winning against Jeopardy! contestants.
Figure: Comparison of precision and percentage of questions answered by the best system before IBM Watson and the top human Jeopardy! players.
What made the game so hard for Watson was that language was a very difficult problem for computers at the time. Language is full of intended and implied meaning. An example of such a sentence is “The name of this hat is elementary, my dear contestant.” People can easily detect the wordplay that evokes “elementary, my dear Watson,” a catchphrase used by Sherlock Holmes, and then remember that the Hollywood version of Sherlock Holmes wears a deerstalker hat. Programming a computer to infer this for a wide range of questions is hard.
To provide a physical presence in the televised games, Watson was represented by a “glowing blue globe criss-crossed by threads of ‘thought,’—42 threads, to be precise,”* referencing the significance of the number 42 in the book The Hitchhiker’s Guide to the Galaxy. Let’s go over how Watson worked.
Watson’s main difference from other systems was its speed and memory. Stored in its memory were millions of documents including books, dictionaries, encyclopedias, and news articles. The data was collected either online from sources like Wikipedia or offline. The algorithm employed different techniques that together allowed Watson to win the competition. The following are a few of these techniques.
First, Watson “read” vast amounts of text. It looked at the text semantically and syntactically, meaning that it tried to tear sentences apart to understand them. For example, it identified the location of sentences’ subjects, verbs, and objects and produced a graph of the sentences, known as syntactic frames. Again, AI used learning techniques much like humans. In this case, Watson learned the basics of grammar similar to how an elementary student does.
Figure: How learning from reading works.
This figure shows the process of learning from reading. First, the text is parsed and turned into syntactic frames. Then, through generalization and statistical aggregation, they are turned into semantic frames.
Most of the algorithms in Watson were not novel techniques. For example, for the clue “He was presidentially pardoned on September 8, 1974,” the algorithm found that this sentence was looking for the subject. It then searched for possible subjects in semantic frames with similar words in them. Based on the syntactical breakdown done in the first step, it generated a set of possible answers. If one of the possible answers it found was “Nixon,” that would be considered a candidate answer. Next, Watson played a clever trick replacing the word “He” with “Nixon,” forming the new sentence “Nixon was presidentially pardoned on September 8, 1974.”
Then, it ran a new search on the generated semantic frame, checking to see if it was the correct answer. The search found a very similar semantic frame “Ford pardoned Nixon on September 8, 1974” with a high confidence score, so the candidate answer was also given a high score. But searching and getting a confidence score was not the only technique applied by Watson.
Evaluating hypotheses was another clever technique that Watson employed to help evaluate its answers. With the clue: “In cell division, mitosis splits the nucleus and cytokinesis splits this liquid cushioning the nucleus,” Watson searched for possible answers in the knowledge base that it acquired through reading. In this case, it found many candidate answers:
Systematically, it tested the possible answers by creating an intermediate hypothesis, checking if the solutions fit the criterion of being liquid. It calculated the confidence of each one of the solutions being liquid using its semantic frames and the same search mechanism described above. The results had the following percentages:
is (“Cytoplasm”, “liquid”) = 0.2
is (“Organelle”, “liquid”) = 0.1
is (“Vacuole”, “liquid”) = 0.1
is (“Plasm”, “liquid”) = 0.1
is (“Mitochondria”, “liquid”) = 0.1
To generate these confidence scores, it searched through its knowledge base and, for example, found the semantic frame:
Cytoplasm is a fluid surrounding the nucleus.
It then checked to see if fluid was a type of liquid. To answer that, it looked at different resources, including WordNet, a lexical database of semantic relations between words, but did not find evidence showing that fluid is a liquid. Through its knowledge base, it learned that sometimes people consider fluid a liquid. With all that information, it created a possible answer set, with each answer having its own probability—a confidence score—assigned to it.
Another technique Watson employed was to cross-check whether candidate answers made sense historically or geographically, checking to see which answers could be eliminated or changing the probability of a response being correct.
For example, for the clue: “In 1594, he took the job as a tax collector in Andalusia.” The two top answers generated by the first pass of the algorithm were “Thoreau” and “Cervantes.” When Watson analyzed “Thoreau” as a possible answer, it discovered that Thoreau was born in 1817, and at that point, Watson ruled that answer out because he was not alive in 1594.
Jeopardy!’s questions are based in categories, limiting the scope of knowledge needed for each answer. Watson used that information to adjust its answer confidence. For example, in the category “Celebrations of the Month”, The first clue was “National Philanthropy Day and All Souls’ Day.” Based on its algorithm, Watson’s answer would be “Day of the Dead” because it classifies this category of the type “Day,” but the correct response was November. Because of that, Watson updated the category type to be a mix of “Day” and “Month,” which boosted answers that are of type “Month.” With time, Watson could update the type of response for a certain category.
Figure: IBM Watson updates the category type when its responses do not reflect the type of response for the correct answer. Then, it updates the possible category type based on the correct answers.
Figure: This image shows the evolution of different versions of IBM Watson throughout its different versions and upgrades.
These techniques were all employed together to make Watson perform at the highest level. In the beginning of 2011, IBM scientists decided that Watson was good enough to play against the best human opponent. They played a practice match before the press on January 13, 2011, and Watson won against Ken Jennings and Brad Rutter, two of the best Jeopardy! players. Watson ended the game with a score of $4,400, Ken Jennings with $3,400, and Brad Rutter with $1,200. Watson and Jennings were tied until the final question, worth $1,000—Watson won the game on that question. After the practice match, Watson was ready to play against the best humans in front of a huge audience on national television.
The first broadcasted match happened a month later on February 14, 2011, and the second match the next day. Watson won the first match but made a huge mistake. In the final round, Watson’s response in the US Cities category to the prompt “Its largest airport is named for a World War II hero; its second largest, for a World War II battle” was “What is Toronto??????” Alex Trebek, the host of Jeopardy! and a Canadian native, made fun of Watson, jokingly saying that he learned that Toronto was an American city.
David Ferrucci, the leading scientist, explained that Watson did not deal with structured databases, so it used US City as a clue to what the possible answer could include and that many American cities are named Toronto. Also, the Canadian baseball team, the Toronto Blue Jays, plays in the American Baseball League. That could be the reason why Watson considered Toronto to be one of the possible answers. Ferrucci also said that very often answers in Jeopardy! are not the types of things that are named in that category. Watson knew that, and so possibly considered that the category “US Cities” might be a clue to the answer. Watson used other elements to contribute to its response as well. The engineers also stated that its confidence was very low, which was indicated by the number of question marks after Watson’s answer. Watson had a 14% confidence percentage for “What is Toronto??????”. The correct answer, “What is Chicago?”, was a close second with an 11% confidence percentage. At the end of the first match, however, Watson had more than triple the money of the second-best competitor. Watson won with $35,734, Rutter with $10,400, and Jennings with $4,800.
Figure: David Ferrucci, the man behind Watson.
To support Watson on the second day of the competition, one of the engineers wore a Toronto Blue Jays jacket. The game started, and Jennings chose the Daily Double clue. Watson responded incorrectly to the Daily Double clue for the first time in the two days of play. After the first round, Watson placed second for the first time in the competition. But in the end, Watson won the second match with $77,147; Jennings finished in second place with $24,000. IBM Watson made history as the first machine to win Jeopardy! against the best humans.
The fundamental shift in solving problems that probabilistic reasoning brought to AI from 1993 to 2011 was a big step forward, but probability and statistics only took developers so far. Geoffrey Hinton created a breakthrough technique called backpropagation to usher in the next era of artificial intelligence: deep learning. His work with multilayer neural networks is the basis of modern-day AI development.
Deep learning is a class of machine learning methods that uses multilayer neural networks that are trained through techniques such as supervised, unsupervised, and reinforcement learning.
In 2012, Geoffrey Hinton and the students at his lab showed that deep neural networks, trained using backpropagation, beat the best algorithms in image recognition by a wide margin.
Right after that, deep learning took off by unlocking a ton of potential. Its first large-scale use was with Google Brain. Led by Andrew Ng, who fed 10 million YouTube videos to 1,000 computers, the system was able to recognize cats and detect faces without hard-coded rules.
Then, deep learning was used by DeepMind to create the first machine learning system to defeat the best Go players in the world. DeepMind combined techniques from previous Go players, using Monte Carlo tree search combined with two deep neural networks that computed the probability of the next possible moves and the chances of winning from each of those. This was a breakthrough due to how hard the game is compared to other games such as chess. The number of total states is magnitudes larger compared to Chess.
The same team used deep neural networks to determine the 3D structure of proteins based on their genome, creating what is known as AlphaFold. This was a solution for a 50-year-old grand challenge in biology.*
New techniques emerged, including the creation of generative adversarial networks, which have two neural networks playing a cat and mouse game wherein one creates fake images that look like the real ones fed into it and the other decides whether they are real. This new technique has been used to create images that look real.
While deep learning required the creation of a new software system, TensorFlow, and new hardware, graphical processing units (GPUs) and tensor processing units (TPUs), what was most needed was a way to train these deep neural networks. Deep learning is only as successful as the training data. Fei-Fei Li’s work created an instrumental dataset, ImageNet, that was used to not only train the algorithms but also as a benchmark in the field. Without that data, deep learning would not be where it is today.
Unfortunately, with data collection at such a massive scale, privacy problems become a concern. While we have come to expect data breaches, it does not have to be the case. Apple and Google continue to fine-tune approaches that do not require the collection of personal data. Differential privacy and federated learning are examples of today’s technology. They allow models to update and learn without leaking individual information about the data that is being used for training.
I learned very early the difference between knowing the name of something and knowing something.Richard Feynman*
Machine learning algorithms usually learn by analyzing data and inferring what kind of model or parameters a model should have or by interacting with the environment and getting feedback from it. Humans can annotate this data or not, and the environment can be simulated or the real world.
The three main categories that machine learning algorithms can use to learn are supervised learning, unsupervised learning, and reinforcement learning. Other techniques can be used, such as evolution strategies or semi-supervised learning, but they are not as widely used or as successful as the three above techniques.
Supervised learning has been widely used in training computers to tag objects in images and to translate speech to text. Let’s say you own a real estate business and one of the most important aspects of being successful is to figure out the price for a house when it enters the market. Determining that price is extremely important for completing a sale, making both the buyer and seller happy. You, as an experienced realtor, can figure out the pricing for a house based on your previous knowledge.
But as your business grows, you need help, so you hire new realtors. To be successful, they also need to determine the price of a house in the market. In the interest of helping these inexperienced people, you write down the value of the houses that the company already bought and sold, based on size, neighborhood, and various details, including the number of bathrooms and bedrooms and the final sale price.
|Bedrooms||Sq. Feet||Neighborhood||Sale Price|
Table: Sample data for a supervised learning algorithm.
This information is called the training data; that is, it is example data that contains the factors or features that may influence the price of a house in addition to the final sale price. New hires look at all this data to start learning which factors influence the final price of a house. For example, the number of bedrooms might be a great indicator of price, but the size of the house may not necessarily be as important. If inexperienced realtors have to determine the price of a new house that enters the market, they simply check to find a house that is most similar and use that information to determine the price.
|Bedrooms||Sq. Feet||Neighborhood||Sale Price|
Table: Missing information that the algorithm will determine.
That is precisely how algorithms learn from training data with a method called supervised learning. The algorithm knows the price of some of the houses in the market, and it needs to figure out how to predict the new price of a house that is entering the market. In supervised learning, the computer, instead of the realtors, figures out the relationship between the data points. The value the computer needs to predict is called the label. In the training data, the labels are provided. When there is a new data point whose value, the label, is not defined, the computer estimates the missing value by comparing it to the ones it has already seen.
Unsupervised learning is a machine learning technique that learns patterns with unlabeled data.
In our example, unsupervised learning is similar to supervised learning, but the price of each house is not part of the information included in the training data. The data is unlabeled.
Table: Sample training data for an unsupervised learning algorithm.
Even without the price of the houses, you can discover patterns from the data. For example, the data can tell that there is an abundance of houses with two bedrooms and that the average size of a house in the market is around 1,200 square feet. Other information that might be extracted is that very few houses in the market in a certain neighborhood have four bedrooms, or that five major styles of houses exist. And with that information, if a new house enters the market, you can figure out the most similar houses by looking at the features or identifying that the house is an outlier. This is what unsupervised learning algorithms do.
The previous two ways of learning are based solely on data given to the algorithm. The process of reinforcement learning is different: the algorithm learns by interacting with the environment. It receives feedback from the environment either by rewarding good behavior or punishing bad. Let’s look at an example of reinforcement learning.
Say you have a dog, Spot, who you want to train to sit on command. Where do you start? One way is to show Spot what “sit” means by putting her bottom on the floor. The other way is to reward Spot with a treat whenever she puts her bottom on the floor. Over time, Spot learns that whenever she sits on command she receives a treat and that this is a rewarded behavior.
Reinforcement learning works in the same way. It is a framework built on top of this insight that you can teach intelligent agents, such as a dog or a deep neural network, to achieve a certain task by rewarding them when they correctly perform the task. And whenever the agent achieves the desired outcome, its chance of repeating such an action increases due to the reward. Agents are algorithms which process input and act as a voice for the output. Spot is the agent in the example.
Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones.
Reinforcement learning as a learning framework is interesting, but the associated algorithms are the most important aspect. The way they work is by defining the reward the agent receives once it achieves a state, like sitting. The formulation of reinforcement algorithms is to find a policy, such as a specific mapping of the states to the actions to be taken, which maximizes the expected reward so that the agent learns the behavior that maximizes the reward (the treat).
In the reinforcement learning formulation, the environment gives the reward: the agent does not figure out the reward itself but only receives it by interacting with the environment and hitting on the expected behavior. One problem with this is that the agent sometimes takes a long time to receive a reward. For example, if Spot never sits, then she never receives a treat and does not learn to sit. Or, let’s say you want an agent to learn how to navigate a maze and the reward is only given when the agent exits the maze. If the agent takes too long before leaving, then it is hard to say which actions the agent took that helped it get out of the maze. Another problem is that the agent only learns from its own successes and failures. That is not necessarily the case with humans in the real world. No one needs to drive off a cliff thousands of times to learn how to drive. People can figure out rewards from observation.
The following two steps define reinforcement algorithms:
Add randomness to the agent’s actions so it tries something different, and
If the result was better than expected, do more of the same in the future.
Adding randomness to the actions ensures that the agent searches for the correct actions to take. And if the result is the one expected, then the agent tries to do more of the same in the future. The agent does not necessarily repeat the exact same actions, however, because it still tries to improve by exploring potentially better actions. Even though reinforcement algorithms can be explained easily, they do not necessarily work for all problems. For reinforcement learning to work, the situation must have a reward, and it is not always easy to define what should or should not be rewarded.
Reinforcement algorithms can also backfire. Let’s say that an agent is rewarded by the number of paper clips it makes. If the agent learns to transform anything into paper clips, it could be that it makes everything into paper clips.* If the reward does not punish the agent when it creates too many paper clips, the agent can misbehave. Reinforcement learning algorithms are also mostly inefficient because they spend a lot of time searching for the correct solution and adding randomized actions to find the right behavior. Even with these limitations, they can accomplish an overwhelming variety of tasks, such as playing Go games at a superhuman level and making robotic arms grasp objects.
Another way of learning that is particularly useful with games is having multiple agents play against each other. Two classic examples are chess or Go, where two agents compete with each other. Agents learn what actions to take by being rewarded when they win the game. This technique is called self-play, and it can be used not only with a reinforcement learning algorithm, but also to generate data. In Go, for example, it can be used to figure out which plays are more likely to make a player win. Self-play generates data from computing power, that is, from the computer playing itself.
The three learning categories are each useful in different situations. Use supervised learning when there is a lot of available data that is labeled by people, such as when others tag people on Facebook. Unsupervised learning is used primarily when there is not much information about the data points that the system needs to figure out, such as in cyber attacks. One can infer that they are being attacked by looking at the data and seeing odd behaviors that were not there before the attack. The last, reinforcement learning, is mainly used when there is not much data about the task that the agent needs to achieve, but there are clear goals, like winning a chess game. Machine learning algorithms, more specifically deep learning algorithms, are trained with these three modes of learning.
I have always been convinced that the only way to get artificial intelligence to work is to do the computation in a way similar to the human brain. That is the goal I have been pursuing. We are making progress, though we still have lots to learn about how the brain actually works.Geoffrey Hinton*
Deep learning is a type of machine learning algorithm that uses multilayer neural networks and backpropagation as a technique to train the neural networks. The field was created by Geoffrey Hinton, the great-great-grandson of George Boole, whose Boolean algebra is a keystone of digital computing.*
The evolution of deep learning was a long process, so we must go back in time to understand it. The technique first arose in the field of control theory in the 1950s. One of the first applications involved optimizing the thrusts of the Apollo spaceships as they headed to the moon.
The earliest neural networks, called perceptrons, were the first step toward human-like intelligence. However, a 1969 book, Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert, demonstrated the extreme limitations of the technology by showing that a shallow network, with only a few layers, could perform only the most basic computational functions. At the time, their book was a huge setback to the field of neural networks and AI.
Getting past the limitations pointed out in Minsky and Papert’s book requires multilayer neural networks. To create a multilayer neural network to perform a certain task, researchers first determine how the neural network will look by determining which neurons connect to which others. But to finish creating such a neural network, the researchers need to find the weights between each of the neurons—how much one neuron’s output affects the next neuron. The training step in deep learning usually does that. In that step, the neural network is presented with examples of data and the training software figures out the correct weights for each connection in the neural network so that it produces the intended results; for example, if the neural network is trained to classify images, then when presented with images that contain cats, it says there is a cat there.
Backpropagation is an algorithm that adjusts the weights in such a way that whenever you change them, the neural network gets closer to the right output faster than was previously possible. The way this works is that the neurons that are closest to the output are the ones adjusted first. Then, after all the classification of images cannot be made better by adjusting those, the prior layer is updated to improve the classification. This process continues until the first layer of neurons is the one adjusted.
In 1986, Hinton published the seminal paper on deep neural networks (DNNs), “Learning representations by back-propagating errors,” with his colleagues David Rumelhart and Ronald Williams.* The article introduced the idea of backpropagation, a simple mathematical technique that led to huge advances in deep learning.
The backpropagation technique developed by Hinton finds the weights for each neuron in a multilayer neural network more efficiently. Before this technique, it took an exponential amount of time to find the weights—also known as coefficients—for a multilayer neural network, which made it extremely hard to find the correct coefficients for each neuron. Before, it took months or years to train a neural network to be the correct one for the inputs, but this new technique took significantly less time.
Hilton’s breakthrough also showed that backpropagation enabled easily training a neural network that had more than two or three layers, breaking through the limitation imposed by shallow neural networks. Backpropagation allowed the innovation of finding the exact weights for a multilayer neural network to create the desired output or outcome. This development allowed scientists to train more powerful neural networks, making them much more relevant. For comparison, one of the most performant neural networks in vision, called Inception, has approximately 22 layers of neurons.
The figure below shows an example of both a simple neural network (SNN) and a deep learning neural network (DLNN). On the left of each network is the input layer, represented by the red dots. These receive the input data. In the SNN, the hidden layer neurons are then used to make the adjustments needed to reach the output (blue dots) on the right side. In contrast, the use of more than one layer characterizes the DLNN, allowing for far more complex behavior that can handle more involved input.
Figure: A simple neural network and a multilayer neural network.
The way researchers usually develop a neural network is first by defining its architecture: the number of neurons and how they are arranged. But the parameters of the neurons inside the neural network need determining. To do that, researchers initialize the neural network weights with random numbers. After that, they feed it the input data and determine if the output is similar to the one they want. If it is not, then they update the weights of the neurons until the output is the closest to what the training data shows.
For example, let’s say you want to classify some images as containing a hot dog and others as not containing a hot dog. To do that, you feed the neural network images containing hot dogs and others that do not, which is the training data. Following the initial training, the neural network is then fed new images and needs to determine if they contain a hot dog or not.
These input images are composed of a matrix of numbers, representing each pixel. The neural network goes through the image, and each neuron applies matrix multiplication, using the internal weights, to the numbers in the image, generating a new image. The outputs of the neurons are a stack of lower resolution images, which are then multiplied by the neurons on the next layer. On the final layer, a number comes out representing the solution. In this case, if it is positive, it means that the image contains a hot dog, and if it is negative, it means that it does not contain a hot dog.
The problem is that the weights are not defined in the beginning. The process of finding the weights, known as training the network, that produce a positive number for images that contain a hot dog and a negative number for those that do not is non-trivial. Because there are many weights in a neural network, it takes a long time to find the correct ones for all the neurons in a way that all the images are classified correctly. Simply too many possibilities exist. Additionally, depending on the input set, the network can become overtrained to the specific dataset, meaning it focuses too narrowly on the dataset and cannot generalize to recognize images outside of it.
The complete process of training the network relies on passing the input data through the network multiple times. Each pass takes the output from the previous one to make adjustments in future passes. Each passes’ output is used to provide feedback to improve the algorithm through backpropagation.
One of the reasons why backpropagation took so long to be developed was that the function required computers to perform multiplication, which they were pretty bad at in the 1960s and 1970s. At the end of the 1970s, one of the most powerful processors, the Intel 8086, could compute less than one million instructions per second.* For comparison,* the processor running on the iPhone 12 is more than one million times more powerful than that.*
Figure: Geoffrey Hinton, who founded the field of deep learning.
Deep learning only really took off in 2012, when Hinton and two of his Toronto students showed that deep neural networks, trained using backpropagation, beat state-of-the-art systems in image recognition by almost halving the previous error rate. Because of his work and dedication to the field, Hinton’s name became almost synonymous with the field of deep learning. He now has more citations than the next top three deep learning researchers combined.
After this breakthrough, deep learning started being applied everywhere, with applications including image classification, language translation, and text-to-speech comprehension as is used by Siri, for example. Deep learning models can improve any task that can be addressed by heuristics, those techniques that are applied to solve some tasks that were previously defined by human experience or thought, including games like Go, chess, and poker as well as activities like driving cars. Deep learning will be used more and more to improve the performance of computer systems with tasks like by figuring out the order that processes should run in or what data should remain in a cache. These tasks can all be much more efficient with deep learning models. Storage will be a big application of it, and in my opinion, the use of deep learning will continue to grow.
It is not a coincidence that deep learning took off and performed better than most of the state-of-the-art algorithms: multilayer neural networks have two very important qualities.*
First, they express the kind of very complicated functions needed to solve problems that we need to address. For example, if you want to understand what is going on with images, you need a function that retrieves the pixels and applies a complicated function that translates them into text or its representation to human language. Second, deep learning can learn from just processing data, rather than needing a feedback response. These two qualities make it extremely powerful since many problems, like image classification, require a lot of data.
The reason why deep neural networks are as good as they are is that they are equivalent to circuits, and a neuron can easily implement a Boolean function. For that reason, a deep enough network can simulate a computer given a sufficient number of steps. Each part of a neural network simulates the simplest part of a processor. That means that deep neural networks are as powerful as computers and, when trained correctly, can simulate any computer program.
Currently, deep learning is a battleground between Google, Apple, Facebook, and other technology companies that aim to serve individuals’ needs in the consumer market. For example, Apple uses deep learning to improve its models for Siri, and Google for its recommendation engine on YouTube. Since 2013, Hinton has worked part-time at Google in Mountain View, California, and Toronto, Canada. And, as of this writing in 2021, he is the lead scientist on the Google Brain team, arguably one of the most important AI research organizations in the world.
There are many types of deep neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs), and each has different properties. For example, recurrent neural networks are deep neural networks in which neurons in higher layers connect back to the neurons in lower layers. Here, we’ll focus on convolutional neural networks, which are computationally more efficient and faster than most other architectures.* They are extremely relevant as they are used for state-of-the-art text translation, image recognition, and many other tasks.
Figure: A recurrent neural network, where one of the neurons feeds back into a previous layer.
The first time Yann LeCun revolutionized artificial intelligence was a false start.* By 1995, he had dedicated almost a decade to what was considered a bad idea according to many computer scientists: that mimicking some features of the brain would be the best way to make artificial intelligence algorithms better. But LeCun finally demonstrated that his approach could produce something strikingly smart and useful.
At Bell Labs, LeCun worked on software that simulated how the brain works, more specifically, how the visual cortex functions. Bell Labs, a research facility owned by the then gigantic AT&T, employed some of the eminent computer scientists of the era. One of the Unix operating systems, which became the basis for Linux, macOS, and Android, was developed there. Not only that, but the transistor, the base of all modern computer chips, as well as the laser and two of the most widely used programming languages to date, C and C++, were also developed there. It was a hub of innovation, so it was not a coincidence that one of the most important deep learning architectures was born in the same lab.
Figure: An image of the primary visual cortex.
LeCun based his work on research done by Kunihiko Fukushima, a Japanese computer researcher.* Kunihiko created a model of artificial neural networks based on how vision works in the human brain. The architecture was based on two types of neuron cells in the human brain called simple cells and complex cells. They are found in the primary visual cortex, the part of the brain that processes visual information.
Simple cells are responsible for detecting local features, like edges. Complex cells pool the results that simple cells produce within an area. For example, a simple cell may detect an edge that may represent a chair. Complex cells aggregate that information by informing the next higher level what the simple cells detected in the layer below.
The architecture of a CNN is based on a cascading model of these two types of cells, and it is mainly used for pattern recognition tasks. LeCun produced the first piece of software that could read handwritten text by looking at many different examples using this CNN model. With this work, AT&T started selling the first machines capable of reading handwriting on checks. For LeCun, this marked the beginning of a new era where neural networks would be used in other fields of AI. Unfortunately, it was not to be.
Figure: Yann LeCun, head of Facebook AI Research.
The same day that LeCun celebrated the launch of bank machines that could read thousands of checks per hour, AT&T announced it was splitting into three different companies, the result of an antitrust lawsuit by the US government. At that point, LeCun became the head of research at a much smaller AT&T and was directed to work on other things. In 2002, he left and eventually became head of the Facebook AI Research group.
LeCun continued working in neural networks, especially in convolutional neural networks, and slowly the rest of the machine learning world came around to the technology. In 2012, some of his students published a paper that demonstrated using CNNs to classify real-world house numbers better than all previous algorithms had been able to do. Since then, deep neural networks have exploded in use, and now most of the research developed in machine learning focuses on deep learning. Convolutional neural networks spread widely and have been used to beat most of the other algorithms for many applications, including natural language processing and image recognition.
The efforts of the team paid off. In 2017, every photo uploaded to Facebook was processed by multiple CNNs. One of them identified which people were in the picture, and another determined if there were objects in the picture. At that time, around 800 million photos were uploaded per day, so the throughput of the CNNs was impressive.
A convolutional neural network (or CNN), is a multilayer neural network. It is named as such because it contains hidden layers that perform convolutions. A convolution is a mathematical function that is the integral of the product of the two functions after one is reversed and shifted. For images, it means that you are running filters on the whole image and producing images with those filters.
Most notably, most inputs to CNNs consist of images.*
On the layers performing convolution, each neuron walks through the image, multiplying the number representing each pixel by the corresponding weight in the neuron, generating a new image as the output.
Let’s examine how a convolutional neural network classifies images. First, we need to make an image of something that a neural network can work with. An image is just data. We represent each pixel of the image as a number; in a black-and-white image, this can indicate how black that pixel is. The figure below represents the number 8. In this representation, 0 is white, and 255 is completely black. The closer the number is to 255, the darker the pixel is.
Figure: An image of eight (left) and the representation of eight in numbers (right).
Figure: The image on the top represents a regular neural network, while the bottom image represents a CNN. Every layer of a CNN transforms the 3D input volume into a 3D output volume.
Think of each neuron as a filter that goes through the entire image. Each layer may have multiple neurons. The figure below shows two neurons walking through the entire image. The red neuron first walks through the image, and the green neuron does the same producing a new resulting image.
The resulting images can go directly to the next layer of the neural network and are processed by those neurons. The processed images in one layer can also be processed by a method called pooling before going to the next layer. The function of pooling is to simplify the results from the previous layers. This may consist of getting the maximum number that the pixels represent in a certain region (or neighborhood) or summing up the numbers in a neighborhood. This is done in multiple layers. When a neuron runs through an image, the next layer produces a smaller image by truncating data. This process is repeated over and over through successive layers. In the end, the CNN produces a list of numbers or a single number, depending on the application.
Figure: The image on the left shows what pooling looks like. The image on the right represents how one of the neurons filters and generates new images based on the input, that is, the convolution operation.
Based on the result, the image can then be classified based on what the system is looking for. For example, if the resulting number is positive, the image can be classified as containing a hot dog, and if the resulting number is negative, then the image is classified as not containing a hot dog. But this assumes that we know what each neuron looks like, that is, what the filter looks like for every layer. In the beginning, the neurons are completely random, and by using the backpropagation technique, the neurons are updated in such a way that they produce the desired result.
Figure: An image of a cat that goes through a multilayer neural network. In the last step of this neural network, a number comes out. If it is positive, then the neural network classifies the image as a cat. If it is negative, it classifies the image as a dog.
A CNN is trained by showing it many images tagged with their results—the labels. This set is called the training data. The neural network updates its weights, based on whether it classifies the images properly or not, using the backpropagation algorithms. After the training stage, the resulting neural network is the one used to classify new images. Even though CNNs were created based on how the visual cortex works, they can also be used in text, for example. To do that, the inputs are translated to a matrix to match the format of an image.
There is a misconception that deep neural networks are a black box, that is, that there is no way of knowing what they are doing. The thing is that there is no way of determining for every input, either image, sound, or text, what the resulting output is, or if the network is going to classify it correctly. But that does not mean that there is no way of determining what each layer does in a neural network.
Figure: How the filters look (gray) for a CNN classifying objects and the corresponding images that activate these filters. The filters in these images in Layer 1 detect edges and, in Layer 2, detect waves and other patterns. Visualizations of Layer 1 and 2. Each layer illustrates 2 pictures, one which shows the filters themselves and one that shows the parts of the image that are most strongly activated by the given filter. For example, in the space labeled Layer 2, we have representations of the 16 different filters (on the left).
In fact, for CNNs, you see what the filters look like and what kind of images activate each layer. The weights in each neuron can be interpreted as pictures. The figure above shows the filters at different layers and also some examples of images that activate these layers. For example, in the first layer of a multilayer CNN, the filters, or weights, for the neurons look like edges. That means that the filters will activate when edges are found. The second layer of filters shows that the types of images that they activate are a little more complex, with eyes, curves, and other shapes. The third layer activates with images such as wheels, profiles of people, birds, and faces. That means that at each layer of the neural network, more complex images are filtered. The first layer filters and passes the information to the next layer that says if an area contains edges or not. Then, the next layer uses that information, and from the detected edges, it will try to find wheels and so forth. The last layer will identify the categories that humans want to know about: it will identify, for example, whether the image contains a cat, hot dog, or human.
The brain sure as hell doesn’t work by somebody programming in rules.Geoffrey Hinton*
Google Brain started as a research project between Google employees Jeff Dean and Greg Corrado and Stanford Professor Andrew Ng in 2011.* But Google Brain turned into much more than simply a project. By acquiring companies such as DeepMind and key AI personnel like Geoffrey Hinton, Google has become a formidable player in advancing this field.
One of the early key milestones of deep neural networks resulted from the initial research led by Ng when he decided to process YouTube videos and feed them to a deep neural network.* Over the course of three days, he fed 10 million YouTube videos* to 1,000 computers with 16 cores each, using the 16,000 computer processors as a neural network to learn the common features in these videos. After being presented with a list of 20,000 different objects, the system recognized pictures of cats and around 3,000 other objects. It started to recognize 16% of the objects without any input from humans.
The same software that recognized cats was able to detect faces with 81.7% accuracy and human body parts with 76.7% accuracy.* With only the data, the neural network learned to recognize images. It was the first time that such a massive amount of data was used to train a neural network. This would become the standard practice for years to come. The researchers made an interesting observation, “It is worth noting that our network is still tiny compared to the human visual cortex, which is times larger in terms of the number of neurons and synapses.”*
Demis Hassabis was a child prodigy in chess, reaching the Master standard at age 13, the second highest-rated player in the World Under-14 category, and also “cashed at the World Series of Poker six times including in the Main Event.”* In 1994 at 18, he began his computer games career co-designing and programming the classic game Theme Park, which sold millions of copies.* He then became the head of AI development for an iconic game called Black & White at Lionhead Studios. Hassabis earned his PhD from the University College London in cognitive neuroscience in 2009.
Figure: Demis Hassabis, CEO of DeepMind.
In 2010, Hassabis co-founded DeepMind in London with the mission of “solving intelligence” and then using that intelligence to “solve everything else.” Early in its development, DeepMind focused on algorithms that mastered games, starting with games developed for Atari.* Google acquired DeepMind in 2014 for $525M.
Figure: Breakout game.
To help the program play the games, the team at DeepMind developed a new algorithm, Deep Q-Network (DQN), that learned from experience. It started playing games like the famous Breakout game, interpreting the video and producing a command on the joystick. If the command produced an action where the player scored, then the learning software reinforced that action. The next time it played the game, it would likely do the same action. It is reinforcement learning, but with a deep neural network to determine the quality of a state-action combination. The DNN helps determine which action to take given the state of the game, and the algorithm learns over time after playing a few games and determining the best actions to take at each point.
Figure: Games that DeepMind’s software played on Atari.* The AI performed better than human level at the ones above the line.
For example, in the case of Breakout,* after playing a hundred games, the software was still pretty bad and missed the ball often. But it kept playing, and after a few hours—300 games—the software improved and played at human ability. It could return the ball and keep it alive for a long time. After they let it play for a few more hours—500 games—it became better than the average human, learning to do a trick called tunneling, which involves systematically sending the ball to the side walls so that it bounces around on top, requiring less work and earning more reward. The same learning algorithm worked not only on Breakout but also for most of the 57 games that DeepMind tried the technique on, achieving superhuman level for most of them.
Figure: Montezuma’s Revenge.
The learning algorithm, however, did not perform well for all games. Looking at the bottom of the list, the software got a score of zero on Montezuma’s Revenge. DeepMind’s DQN software does not succeed in this game because the player needs to understand high-level concepts that people learn throughout their lifetime. For example, if you look at the game, you know that you are controlling the character and that ladders are for climbing, ropes are for swinging, keys are probably good, and the skull is probably bad.
Figure: Montezuma’s Revenge (left) and the teacher and student neural networks (right)
DeepMind improved the system by breaking the problem into simpler tasks. If the software could solve things like “jump across the gap,” “get to the ladder,” and “get past the skull and pick up the key,” then it could solve the game and perform well at the task. To attack this problem, DeepMind created two neural networks—the teacher and the student. The teacher is responsible for learning and producing these subproblems. The teacher sends these subproblems to another neural network called the student. The student takes actions in the game and tries to maximize the score, but it also tries to do what the teacher tells it. Even though they were trained with the same data as the old algorithm, plus some additional information, the communication between the teacher and the student allowed strategy and communication to emerge over time, helping the agent learn how to play the game.
In the Introduction, we discussed the Go competition between Lee Sedol and AlphaGo. Well, DeepMind developed AlphaGo with the goal of playing Go against the Grandmasters. October 2015 was the first time that software beat a human at Go, a game with has around positions, more possible positions than the number of moves in chess or even the total number of atoms in the universe (around ). In fact, if every atom in the universe were a universe itself, there would be fewer atoms than the number of positions in a Go game.
In many countries such as South Korea and China, Go is considered a national game, like football and basketball are in the US, and these countries have many professional Go players, who train from the age of 6.* If these players show promise in the game, they switch from a normal school to a special Go school where they play and study Go for 12 hours a day, 7 days a week. They live with their Go Master and other prodigy children. So, it is a serious matter for a computer program to challenge these players.
There are around 2,000 professional Go players in the world, along with roughly 40 million casual players. In an interview at the Google Campus,* Hassabis shared, “We knew that Go was much harder than chess.” He describes how he initially thought of building AlphaGo the same way that Deep Blue was built, that is, by building a system that did a brute-force search with a handcrafted set of rules.
But he realized that this technique would never work since the game is very contextual, meaning there was no way to create a program that could determine how one part of the board would affect other parts because of the huge number of possible states. At the same time, he realized that if he created an algorithm to beat the Master players in Go, then he would probably have made a significant advance in AI, more meaningful than Deep Blue.
Go is not only hard because the game has an astronomical number of possibilities, but for a program to be good at playing Go, it needs to determine the best next move. To figure that out, the software needs to determine if a position is good or not. It cannot play all possibilities until the end because there were too many. Conventional wisdom thought it impossible to determine the value of a certain game state for Go. It’s much simpler to do this for chess. For example, you can codify things like pawn structure and piece mobility, which are techniques Grandmasters use to determine if a position is good or bad. In Go, on the other hand, all pieces are the same. Even a single stone can modify the outcome of the game, so each one has a profound impact on the game.
What makes Go even harder is that it is a constructive game as opposed to a destructive one. In chess, you start with all the pieces on the board and take them away as you play, making the game simpler. The more you play, the fewer the possibilities there are for the next moves. Go, however, begins with an empty board, and you add pieces, making it harder to analyze. In chess, if you analyze a complicated middle game, you can evaluate the current situation, and that tells everything. To analyze a middle game in Go, you have to project into the future to examine the current situation of the board, which makes it much harder to analyze. In reality, Go is more about intuition and instinct rather than calculation like chess.
When describing the algorithm that AlphaGo produced, Hassabis said that it does not merely regurgitate human ideas and copy them. It genuinely comes up with original ideas. According to him, Go is an objective art because anyone can come up with an original move, but you can measure if that move or idea was pivotal for winning the game. Even though Go has been played at a professional level for 3,000 years, AlphaGo created new techniques and directly influenced how people played because AlphaGo was strategic and seemed human.
Figure: The Policy Network. Probability Distribution over moves. The darker the green squares, the higher the probability.
DeepMind developed AlphaGo mainly with two neural networks.* The first, the Policy Network, calculates the probability of the next move for a professional player given the state of the board. Instead of looking for the next 200 moves, the Policy Network only looks at the next five to ten. By doing that, it reduces the number of positions AlphaGo has to search in order to find what is the next best move.
Figure: The Value Network. How the position evaluator sees the board. Darker blue represents places where the next stone leads to a more likely win for the player.
Initially, AlphaGo was trained using around 100,000 Go games from the internet. But once it could narrow down the search for the next move, AlphaGo played against itself millions of times and improved through reinforcement learning. The program learned through its own mistakes. If it won, it would make it more likely to make those moves in the next game. With that information, it created its own database of millions of games of the system playing against itself.
Then, DeepMind trained a second neural network, the Value Network, which calculated the probability of a player winning based on the state of the board. It outputs 1 for black winning, 0 for white, and 0.5 for a tie. Together, these two networks turned what seemed an intractable problem into a manageable one. They transformed the problem of playing Go into a similar problem of solving a chess game. The Policy Network provides the ten next probable moves, and the Value Network gives the score of a board state. Given these two networks, the algorithm to find the best moves, the Monte Carlo Tree Search—similar to the one used to play chess, i.e., the min-max search described earlier—uses the probabilities to explore the space of possibilities.
AlphaGo played the best player in the world, Lee Sedol, a legend of the game who had won over 18 world titles. At stake was $1M of bounty.
Figure: AlphaGo’s first 99 moves in game 2. AlphaGo controls the black piece.
AlphaGo played unimaginable moves, changing the way people would play Go in the years ahead. The European champion, Fen Hui, told Hassabis that his mind was free from the shackles of tradition after seeing the innovative plays from AlphaGo; he now considered unthinkable thoughts and was not constrained by the received wisdom that had bound him for decades.
“‘I was deeply impressed,’ Ke Jie said through an interpreter after the game. Referring to a type of move that involves dividing an opponent’s stones, he added: ‘There was a cut that quite shocked me, because it was a move that would never happen in a human-to-human Go match.’”* Jie, the current number one Go player in the world, stated, “Humanity has played Go for thousands of years, and yet, as AI has shown us, we have not even scratched the surface.” He continued, “The union of human and computer players will usher in a new era. Together, man and AI can find the truth of Go.”*
The DeepMind team did not stop after winning against the best player in the world. They decided to improve the software, so they merged the two deep neural networks, the Policy and Value Networks, creating AlphaGo Zero. “Merging these functions into a single neural network made the algorithm both stronger and much more efficient,” said David Silver, the lead researcher on AlphaGo. AlphaGo Zero removed a lot of redundancies between the two neural networks. The probability of adding a piece to a position (Policy Network) contains the information of who might win the game (Value Network). If there is only one position for the player to place a piece, it might mean that the player is cornered and has a low chance of winning. Or, if the player can place a piece anywhere on the board, that probably means that the player is winning because it does not matter where it places the next piece on the board. And because merging the networks removed these redundancies, the resulting neural network was smaller, making it much less complex and, therefore, more efficient. The network had less weights and required less training to figure out. “It still required a huge amount of computing power—four of the specialized chips called tensor processing units (TPUs), which Hassabis estimated to be US$25M of hardware. But its predecessors used ten times that number. It also trained itself in days rather than months. The implication is that ‘algorithms matter much more than either computing or data available’, said Silver.”*
Critics point out that what AlphaGo does is not actually learning because if you were to make the agent play the game with the same rules but changed the color of the pieces, then the program would be confused and would perform terribly. Humans can play well if there are minor changes to the game. But that can be solved by training the software with different games as well as different colored pieces. DeepMind made its algorithms generalize for other games like Atari by using a technique called synaptic consolidation. Without this new method, when an agent first learns to play the game, the agent saturates the neural connections with the knowledge on how to play the first game. Then, when the agent starts learning how to play a variation of the game or a different game, all the connections are destroyed in order to learn how to play the second game, producing catastrophic forgetting.
If a simple neural network is trained to play Pong controlling the paddle on the left, then after it learns how to play the game, the agent will always win, obtaining a score of 20 to 0. If you change the color of the paddle from green to black, the agent is still controlling the same paddle, but it will miss the ball every time. It ends up losing 20 to 0, showing a catastrophic failure. DeepMind solved this using inspiration from how a mouse’s brain works, and presumably how the human brain works as well. In the brain, there is a process called synaptic consolidation. It is the process in which the brain protects only neural connections that form when a particular new skill is learned.
Figure: AlphaGo Zero Elo rating over time. At 0 days, AlphaGo Zero has no prior knowledge of the game and only the basic rules as an input. At 3 days, AlphaGo Zero surpasses the abilities of AlphaGo Lee, the version that beat world champion Lee Sedol in 4 out of 5 games in 2016. At 21 days, AlphaGo Zero reaches the level of AlphaGo Master, the version that defeated 60 top professionals online and world champion Ke Jie in 3 out of 3 games in 2017. At 40 days, AlphaGo Zero surpasses all other versions of AlphaGo and, arguably, becomes the best Go player in the world. It does this entirely from self-play, with no human intervention and using no historical data.
Taking inspiration from biology, DeepMind developed an algorithm that does the same. After it plays a game, it identifies and protects only the neural connections that are the most important for that game. That means that whenever the agent starts playing a new game, spare neurons can be used to learn a new game, and at the same time, the knowledge that was used to play the old game is preserved, eliminating catastrophic forgetting. With this technique, DeepMind showed that the agents could learn how to play ten Atari games without forgetting how to play the old ones at superhuman level. The same technique can be used to make AlphaGo learn to play different variations of the Go game.
OpenAI, a research institute started by tech billionaires including Elon Musk, Sam Altman, Peter Thiel, and Reid Hoffman, had a new challenge. OpenAI was started to advance artificial intelligence and prevent the technology from turning dangerous. In 2016, led by CTO and co-founder, Greg Brockman, they started looking on Twitch, an internet gaming community, to find the most popular games that had an interface a software program could interact with and that ran on the Linux operating system. They selected Dota 2. The idea was to make a software program that could beat the best human player as well as the best teams in the world—a lofty goal. By a wide margin, Dota 2 would be the hardest game at which AI would beat humans.
At first glance, Dota 2 may look less cerebral than Go and chess because of its orcs and creatures. The game, however, is much more difficult than those strategy games because the board itself and the number of possible moves is much greater than the previous games. Not only that, but there are around 110 heroes, and each one has at least four moves. The average number of possible moves per turn for chess is around 20 and for Go, about 200. Dota 2 has on average 1,000 possible moves for every eighth of a second, and the average match lasts around 45 minutes. Dota 2 is no joke.
Figure: OpenAI team gathered at its headquarters in San Francisco, California.
Its first challenge was to beat the best humans in a one-on-one match. That meant that a machine only played against a single human controlling their own creature. That is a much easier feat than a computer playing versus a team of five humans (5v5) because playing one-on-one is a more strategic game and a much tougher challenge for humans. With a team, one player could defend and another attack at the same time, for example.
To beat the best humans at this game, OpenAI developed a system called Rapid. In this system, AI agents played against themselves millions of times per day, using reinforcement learning to train a multilayer neural network, an LSTM specifically, so that it could learn the best moves over time. With all this training, by August of the following year, OpenAI agents defeated the best Dota 2 players in one-on-one matches and remained undefeated against them.
Figure: OpenAI Matchmaking rating showing the program’s skill level for Dota 2 over time.
OpenAI then focused on the harder version of the game: 5v5. The new agent was aptly named OpenAI Five. To train it, OpenAI again used its Rapid system, playing the equivalent of 180 years of Dota 2 per day. It ran its simulations using the equivalent of 128,000 computer processors and 256 GPUs. Training focused on updating an enormous long short-term memory network with the games and applying reinforcement learning. OpenAI Five takes an impressive amount of input: 20,000 numbers representing the board state and what the players are doing at any given moment.*
Figure: The Casper Match at which OpenAI beat the Casper team.
In January of 2018, OpenAI started testing its software against bots, and its AI agent was already winning against some of them. By August of the same year, OpenAI Five beat some of the best human teams in a 5v5 match, albeit with some limitations to the rules.
So, the OpenAI team decided to play against the best team in the world at the biggest Dota 2 competition, The International, in Vancouver, Canada. At the end of August, the AI agents played against the best team in the world in front of a huge audience in a stadium, and hundreds of thousands of people watched the game through streaming. In a very competitive match, it lost against the human team. But in less than a year, in April 2019, OpenAI Five beat the best team in the world in twin back-to-back games at the International.*
Computer languages of the future will be more concerned with goals and less with procedures specified by the programmer.Marvin Minsky*
The Software 2.0 paradigm started with the development of the first deep learning language, TensorFlow.
Based on the goal, the programmer writes the skeleton of the program by defining the neural network architecture(s). Then, the programmer uses the computer hardware to find the exact neural network that best performs the specified goal and feeds it data to train the neural network. With traditional software, Software 1.0, most programs are stored as programmer-written code that can span thousands to a billion lines of code. For example, Google’s entire codebase has around two billion lines of code.* But in the new paradigm, the program is stored in memory as the weights of the neural architecture with few lines of code written by programmers. There are disadvantages to this new approach: software developers sometimes have to choose between using software that they understand but only works 90% of the time or a program that performs well in 99% of the cases but it is not as well understood.
Some languages were created only for writing Software 2.0, that is, programming languages to help build, train, and run these neural networks. The most well-known and widely used one is TensorFlow. Developed by Google and released internally in 2015, it now powers Google products like Smart Reply and Google Photos, but it was also made available for external developers to use. It is now more popular than the Linux operating system by some metrics. It became widely used by developers, startups, and other big companies for all types of machine learning tasks, including translation from English into Chinese and reading handwritten text. TensorFlow is used to create, train, and deploy a neural network to perform different tasks. But to train the resulting network, the developer must feed it data and define the goal the neural network optimizes for. This is as important as defining the neural network.
Because a large part of the program is the data fed to it, there is growing concern that datasets used for these networks represent all possible scenarios that the program may run into. The data has become essential for the software to work as expected. One of the problems is that sometimes the data might not represent all use cases that a programmer wants to cover when developing the neural network. And, the data might not represent the most important scenarios. So, the size and variety of the dataset have become more and more important in order to have neural networks that perform as expected.
For example, let’s say that you want a neural network that creates a bounding box around cars on the road. The data needs to cover all cases. If there is a reflection of a car on a bus, then the data should not have it labeled as a car on the road. For the neural network to learn that, the programmer needs to have enough data representing this use case. Or, let’s say that five cars are in a car carrier. Should the software create a bounding box for each of the automobiles or just for the car carrier? Either way, the programmer needs enough examples of these cases in the dataset.
Another example is if the car’s training data comes with a lot of data gathered in certain lighting conditions or with a specific vehicle. Then, if those same algorithms encounter a vehicle with a different shape or in different lighting, the algorithm may behave differently. One example that happened to Tesla was when the self-driving software was engaged and the software didn’t notice the trailer in front of the car. The white side of the tractor trailer against a brightly lit sky was hard to detect. The crash resulted in the death of the driver.*
Labeling, that is, creating the dataset and annotating it with the correct information, is an important iterative process that takes time and experience to make work correctly. The data needs to be captured and cleaned. It is not something done once and then it is complete. Rather, it is something that evolves.
TensorFlow has not only been used by developers, startups, and large corporations, but also by individuals. One surprising story is of a Japanese cucumber farmer. An automotive engineer, Makoto Koike, helped his parents sort cucumbers by size, shape, color, and other attributes on their small family farm in the small city of Kosai. For years, they sorted their pickles manually. It happens that cucumbers in Japan have different prices depending on their characteristics. For example, more colorful cucumbers and ones with many prickles are more expensive than others. Farmers pull aside the cucumbers that are more expensive so that they are paid fairly for their crop.
The problem is that it is hard to find workers to sort them during harvest season, and there are no machines sold to small farmers to help with the cucumber sorting. They are either too expensive or do not provide the capabilities small farms need. Makoto’s parents separated the cucumbers by hand, which is as hard as growing them and takes months. Makoto’s mother used to spend eight hours per day sorting them. So, in 2015, after seeing how AlphaGo defeated the best Go players, Makoto had an idea. He decided to use the same programming language, TensorFlow, to develop a cucumber-sorting machine.
To do that, he snapped 7,000 pictures of cucumbers harvested on his family’s farm. Then, he tagged the pictures with the properties that each cucumber had, adding information regarding their color, shape, size, and whether they contained prickles. He used a popular neural network architecture and trained it with the pictures that he took. At the time, Makoto did not train the neural network with the computer servers that Google offered because they charged by time used. Instead, he trained the network using his low-power desktop computer. Therefore, to train his tool in a timely manner, he converted the pictures to a smaller size of 80x80 pixels. The smaller the size of the images, the faster it is to train the neural network because the neural network is smaller as well. But even with the low resolution, it took him three days to train his neural network.
After all the work, “when I did a validation with the test images, the recognition accuracy exceeded 95%. But if you apply the system with real-use cases, the accuracy drops to about 70%. I suspect the neural network model has the issue of ‘overfitting,’” Makoto stated.*
Overfitting, also called overtraining, is the phenomenon when a machine learning model is created and only works for the training data.
Makoto created a machine that was able to help his parents sort the cucumbers into different shapes, color, length, and level of distortion. It was not able to figure out if the cucumbers had many prickles or not because of the low-resolution images used for training. But the resulting machine turned out to be very helpful for his family and cut out the time that they spent manually sorting their produce.
The same technology built by one of the largest companies in the world and used to power its many products was also used by a small farmer on the other side of the globe. TensorFlow democratized access so many people could develop their own deep learning models. It will not be surprising to find many more “Makotos” out there.
In Software 1.0, problems—called bugs—happened mostly because a person wrote logic that did not account for edge cases or handle all the possible scenarios. But in the Software 2.0 stack, bugs are much different because the data may confuse the neural network.
One example of such a bug was when the autocorrect for iOS started using a weird character “# ?” to replace the word “I” when sending a message. The operating system mistakenly autocorrected the spelling of I because, at some point, the data it received taught it to do so. The model learned that “# ?” was the correct spelling according to the data. As soon as someone sent “I,” the model thought it was important to fix and replaced it everywhere it could. The bug spread like a virus, reaching millions of iPhones. Given how fast and important these bugs can be, it is extremely important that the data as well as the programs are well tested, making sure that these edge cases do not make programs fail.
What I cannot create, I do not understand.Richard Feynman*
One of the past decade’s most important developments in deep learning is generative adversarial networks, developed by Ian Goodfellow. This new technology can also be used for ill intent, such as for generating fake images and videos.
Figure: GANs generated by a computer. The above images look real, but more than that, they look familiar.* They resemble a famous actress or actor that you may have seen on television or in the movies. They are not real, however. A new type of neural network created them.
GAN, or generative adversarial network, is a class of machine learning framework where two neural networks play a cat and mouse game. One creates fake images that look like the real ones fed into it, and the other decides if they are real.
Generative adversarial networks (GANs), sometimes called generative networks, created these fake images. The Nvidia research team used this new technique by feeding thousands of photos of celebrities to a neural network. The neural network has in turn produced thousands of pictures, like the ones above, that resemble the famous faces. They look real, but machines created them. GANs allow researchers to build images that look like the real ones by sharing many features of the images the neural network was fed. It can be fed photographs of objects from tables to animals, and after being trained, it produces pictures that resemble the originals.
Figure: Of the two images above, can you tell the real from the fake?*
For the Nvidia team to generate these images, it set up two neural networks. One that produced the pictures and the other that determined if they were real or fake. Combining these two neural networks produced a GAN, or generative adversarial network. They play a cat and mouse game, where one creates fake images that look like the real ones fed into it, and the other decides if they are real. Does this remind you of anything? The Turing test. Think of the networks as playing the guessing game of whether the images are real or fake.
After the GAN has been trained, one of the neural networks creates fake images that look like the real ones used in training. The resulting pictures look exactly like real peoples’ pictures. This technique can generate large amounts of fake data that can help researchers predict the future or even construct simulated worlds. That is why for Yann LeCun, Director of Facebook AI Research, “Generative Adversarial Networks is the most interesting idea in the last ten years in machine learning.”* GANs will be helpful for creating images and maybe creating software simulations of the real world, where developers can train and test other types of software. For example, companies writing self-driving software for cars can train and check their software in simulated worlds. I discuss this in detail later in this book.
These simulated worlds and situations are now handcrafted by developers, but some believe that these scenarios will all be created by GANs in the future. GANs generate new images and videos from very compressed data. Thus, you could use a GAN’s two neural networks to save data and then reinstate it. Instead of zipping your files, you could use one neural network to compress it and the other to generate the original videos or images. It is no coincidence that in the human brain some of the apparatus used for imagination is the same as the one used for memory recall. Demis Hassabis, the founder of DeepMind, published a paper* that “showed systematically for the first time that patients with damage to their hippocampus, known to cause amnesia, were also unable to imagine themselves in new experiences.”* The finding established a link between the constructive process of imagination* and the reconstructive process of episodic memory recall.* There are more details regarding this later in this book.
Figure: Increasingly realistic synthetic faces generated by variations on generative adversarial networks through the years.
Ian Goodfellow, the creator of GANs, came up with the idea at a bar in Montreal when he was with fellow researchers discussing what goes into creating photographs. The initial plan was to understand the statistics that determined what created photos, and then feed them to a machine so that it could produce the pictures. Goodfellow thought that the idea would never work because there are too many statistics needed. So, he thought about using a tool, a neural network. He could teach neural networks to figure out the underlying characteristics of the pictures fed to the machine and then generate new ones.
Figure: Ian Goodfellow, creator of generative adversarial networks.
Goodfellow then added two neural networks so that they could together build realistic photographs. One created fake images, and the other determined if they were real. The idea was that one of the adversary networks would teach the other how to produce images that could not be distinguished from the real ones.
On the same night that he came up with the idea, he went home, a little bit drunk, and stayed up that night coding the initial concept of a GAN on his laptop. It worked on the first try. A few months later, he and a few other researchers published the seminal paper on GANs at a conference.* The trained GAN used handwritten digits from a well-known training image set called MNIST.*
In the following years, hundreds of papers were published using the idea of GANs to produce not only images but also videos and other data. Now at Google Brain, Goodfellow leads a group that is making the training of these two neural networks very reliable. The result from this work is services that are far better at generating images and learning sounds, among other things. “The models learn to understand the structure of the world,” Goodfellow says. “And that can help systems learn without being explicitly told as much.”
Figure: Synthetically generated word images.*
GANs could eventually help neural networks learn with less data, generating more synthetic images that are then used to identify and create better neural networks. Recently, a group of researchers at Dropbox improved their mobile document scanner by using synthetically generated images. GANs produced new word images that, in turn, were used to train the neural network.
And, that is just the start. Researchers believe that the same technique can be applied to develop artificial data that can be shared openly on the internet while not revealing the primary source, making sure that the original data stays private. This would allow researchers to create and share healthcare information without sharing sensitive data about patients.
GANs also show promise for predicting the future. It may sound like science fiction now, but that might change over time. LeCun is working on writing software that can generate video of future situations based on current video. He believes that human intelligence lies in the fact that we can predict the future, and therefore, GANs will be a powerful force for artificial intelligence systems in the future.*
Even though GANs are generating new images and sounds, some people ask if GANs generate new information. Once a GAN is trained on a collection of data, can it produce data that contains information outside of its training data? Can it create images that are entirely different from the ones fed to it?
A way of analyzing that is by what is called the Birthday Paradox Test. This test derives its name from the implication that if you put 23—two soccer teams plus a referee—random people in a room, the chance that two of them have the same birthday is more than 50%.
This effect happens because with 365 days in a year, you need at least a number of people around the square root of that to see a duplicate birthday. The Birthday Paradox says that for a discrete distribution that has support N, then a random sample size of √N would likely contain a duplicate. What does that mean? Let me break it down.
If there are 365 days a year, then you need the square root of 365—√365—people to have a probable chance of two having the same birthdays, which means about 19 people. But this also works for the other side of the equation. If you do not know the number of days in a year, then you can select a fixed number of people and ask them for their birthdays. If there are two people with the same birthday, you can infer the number of days in a year with high probability based on the number of people. If you have 22 people in the room, then the number of days in a year is the square of 22—— about 484 days per year, an approximation of the actual number of days in a year.
The same test can check the size of the original distribution of a GAN’s generated images. If the result reveals that a set of K images contains duplicates with reasonable probability, then you can suspect that the number of original images is about K². So, if a test shows that it is very likely to find a duplicate in a set of 20 images, then the size of the original set of images is approximately 400. This test can be run by selecting subsets of images and checking how often we find duplicates in these subsets. If we find duplicates in more than 50% of the subsets of a certain length, we can use that size for our approximation.
With that test in hand, researchers have shown that images generated by famous GANs do not generalize beyond what the training data provides. Now, what is left to prove is whether GANs can be improved to generalize beyond the training data or if there are ways of generalizing beyond the original images by using other methods to improve the training dataset.
There are concerns that people can use the GAN technique with ill intent.* With so much attention on fake media, we could face an even broader range of attacks with fake data. “The concern is that these methods will rise to the point where it becomes very difficult to discern truth from falsity,” said Tim Hwang, who previously oversaw AI policy at Google and is now director of the Ethics and Governance of Artificial Intelligence Fund, an organization supporting ethical AI research. “You might believe that accelerates problems we already have.”*
Even though this technique cannot create still images of high quality, researchers believe that the same technology could produce videos, games, and virtual reality. The work to start generating videos has already begun. Researchers are also using a wide range of other machine learning methods to generate faux data. In August of 2017, a group of researchers at the University of Washington was featured in headlines when they built a system that could put words in Barack Obama’s mouth in a video. An app with this technique is already on the Apple’s App Store.* The results were not completely convincing, but the rapid progress in the area of GANs and other techniques point to a future where it becomes tough for people to differentiate between real videos and generated ones. Some researchers claim that GANs are just another tool like others that can be used for good or evil and that there will be more technology to figure out if the newly created videos and images are real.
Not only that, but researchers have uncovered ways of using GANs to generate audio that sounds like one thing to humans but something else to machines.* For example, you can develop audio that sounds to humans like “Hi, how are you?” and to machines, “Alexa, buy me a drink.” Or, audio that sounds like a Bach symphony to a human, but for the machine, it sounds like “Alexa, go to this website.”
The future has unlimited potential. Digital media may surpass analog media by the end of this decade. We are starting to see examples of this with companies like Synthesia.* Other examples of digital media are encountered with Imma, an Instagram model that is completely generated by computers and has around 350k followers.* It won’t be surprising to see more and more digital media in the world as the cost of creating such content goes down and they can be anything that their creators want.
Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.Clive Humby*
Data is key to deep learning, and one of the most important datasets, ImageNet, created by Fei-Fei Li, marked the beginning of the field. It is used for training neural networks as well as to benchmark them against others.
Deep learning is a revolutionary field, but for it to work as intended, it requires data.* The term for these large datasets and the work around them is Big Data, which refers to the abundance of digital data. Data is as important for deep learning algorithms as the architecture of the network itself, the software. Acquiring and cleaning the data is one of the most valuable aspects of the work. Without data, neural networks cannot learn.*
Most of the time, researchers can use the data given to them directly, but there are many instances where the data is not clean. That means it cannot be used directly to train the neural network because it contains data that is not representative of what the algorithm wants to classify. Perhaps it contains bad data, like black-and-white images when you want to create a neural network to locate cats in colored images. Another problem is when the data is not appropriate. For example, when you want to classify images of people as male or female. There might be pictures without the needed tag or pictures that have the information corrupted with misspelled words like “ale” instead of “male.” Even though these might seem like crazy scenarios, they happen all the time. Handling these problems and cleaning up the data is known as data wrangling.
Researchers also sometimes have to fix problems with how data is represented. In some places, the data might be expressed one way, and in others the same data can be described in a completely different way. For example, a disease like diabetes might be classified with a certain number (3) in one database and (5) in another. This is one reason for the considerable effort in industries to create standards for sharing data more easily. For example, Fast Healthcare Interoperability Resources (FHIR) was created by the international health organization, Health Level Seven International, to create standards for exchanging electronic health records.
Standardizing data is essential, but selecting the correct input is also important because the algorithm is created based on the data.* And, choosing that data is not easy. One of the problems that can occur when selecting data is that it can be biased in some way, creating a problem known as selection bias. That means that the data used to train the algorithm does not necessarily represent the entire space of possibilities. The saying in the industry is, “Garbage in, garbage out.” That means that if the data entered into the system is not correct, then the model will not be accurate.
Fei-Fei Li, who was the director of the Stanford Artificial Intelligence Laboratory and also the Chief Scientist of AI/ML at Google Cloud, could see data was essential to the development of machine learning algorithms early on,* before many of her colleagues.
Figure: Professor Fei-Fei Li.
Li realized that to make better algorithms and more performant neural networks, more and better data was needed and that better algorithms would not come without that data. At the time, the best algorithms could perform well with the data that they were trained and tested with, which was very limited and did not represent the real world. She realized that for the algorithms to perform well, data needed to resemble actuality. “We decided we wanted to do something that was completely historically unprecedented,” Li said, referring to a small team initially working with her. “We’re going to map out the entire world of objects.”
To solve the problem, Li constructed one of the most extensive datasets for deep learning to date, ImageNet. The dataset was created, and the paper describing the work was published in 2009 at one of the key computer vision conferences, Computer Vision and Pattern Recognition (CVPR), in Miami, Florida. The dataset was very useful for researchers and because of that, it became more and more famous, providing the benchmark for one of the most important annual deep learning competitions, which tested and trained algorithms to identify objects with the lowest error rate. ImageNet became the most significant dataset in the computer vision field for a decade and also helped boost the accuracy of algorithms that classified objects in the real world. In only seven years, the winning algorithms’ accuracy in classifying objects in images increased from 72% to nearly 98%, overtaking the average human’s ability.
But ImageNet was not the overnight success many imagine. It required a lot of sweat from Li, beginning when she taught at the University of Illinois Urbana-Champaign. She was dealing with problems that many other researchers shared. Most of the algorithms were overtraining to the dataset given to them, making them unable to generalize beyond it. The problem was that most of the data presented to these algorithms did not contain many examples, so they did not have enough information about all the use cases for the models to work in the real world. She, however, figured out that if she generated a dataset that was as complex as reality, then the models should perform better.
It is easier to identify a dog if you see a thousand pictures of different dogs, at different camera angles and in lighting conditions, than if you only see five dog pictures. In fact, it is a well-known rule of thumb that algorithms can extract the right features from images if there are around 1,000 images for a certain type of object.
Li started looking for other attempts to create a representation of the real world, and she came across a project, WordNet, created by Professor George Miller. WordNet was a dataset with a hierarchical structure of the English language. It resembled a dictionary, but instead of having an explanation for each word, it had a relation to other words. For example, the word “monkey” is underneath the word “primate,” which is in turn underneath the word “mammal.” In this way, the dataset contained the relation of all the words among others.
After studying and learning about WordNet, Li met with Professor Christiane Fellbaum, who worked with Miller on WordNet. She gave Li the idea to add an image and associate it to each word, creating a new hierarchical dataset based on images instead of words. Li expanded on the idea—instead of adding one image per word, she added many images per word.
As an assistant professor at Princeton, she built a team to tackle the ImageNet project. Li’s first idea was to hire students to find images and add them to ImageNet manually. But she realized that it would become too expensive and take too much time for them to finish the project. From her estimates, it would take a century to complete the work, so she changed strategies. Instead, she decided to get the images from the internet. She could write algorithms to find the pictures, and humans would choose the correct ones. After months working on this idea, she found that the problem with this strategy was that the images chosen were constrained to the algorithms that picked the images. Unexpectedly, the solution came when Li was talking to one of her graduate students, who mentioned a service that allows humans anywhere in the world to complete small online tasks very cheaply. With Amazon Mechanical Turk, she found a way to scale and have thousands of people find the right images for not too much money.
Amazon Mechanical Turk was the solution, but a problem still existed. Not all the workers spoke English as their first language, so there were issues with specific images and the words associated with them. Some words were harder for these remote workers to identify. Not only that, but there were words like “babuin” that confused workers—they did not exactly know which images represented the word. So, her team created a simple algorithm to figure out how many people had to look at each image for a given word. Words that were more complex like “babuin” required more people to check images, and simpler words like “cat” needed only a few people.
With Mechanical Turk, creating ImageNet took less than three years, much less than the initial estimate with only undergraduates. The resulting dataset had around 3 million images separated into about 5,000 “words.” People were not impressed with her paper or dataset, however, because they did not believe that more and more refined data led to better algorithms. But most of these researchers’ opinions were about to change.
To prove her point, Li had to show that her dataset led to better algorithms. To achieve that, she had the idea of creating a challenge based on the dataset to show that the algorithms using it would perform better overall. That is, she had to make others train their algorithms with her dataset to show that they could indeed perform better than models that did not use her dataset.
The same year she published the paper in CVPR, she contacted a researcher named Alex Berg and suggested that they work together to publish papers to show that algorithms using the dataset could figure out whether images contained particular objects or animals and where they were located in the picture. In 2010 and 2011, they published five papers using ImageNet.* The first became the benchmark of how algorithms would perform on these images. To make it the benchmark for other algorithms, Li reached out to the team supporting one of the most well-known image recognition dataset and benchmark standards, PASCAL VOC. They agreed to work with Li and added ImageNet as a benchmark for their competition. The competition used a dataset called PASCAL that only had 20 classes of images. By comparison, ImageNet had around 5,000 classes.
As Li predicted, the algorithms that were trained using the ImageNet dataset performed better and better as the competition continued. Researchers learned that algorithms started performing better for other datasets when the models were first trained using ImageNet and then fine-tuned for another task. A detailed discussion on how this worked for skin cancer is in a later section.
A major breakthrough occurred in 2012. The creator of deep learning, Geoffrey Hinton, together with Ilya Sutskever and Alex Krizhevsky submitted a deep convolutional neural network architecture called AlexNet—still used in research to this day—“which beat the field by a whopping 10.8 percentage point margin.”* That marked the beginning of deep learning’s boom, which would not have happened without ImageNet.
ImageNet became the go-to dataset for the deep learning revolution and, more specifically, that of the convolution neural networks (CNNs) led by Hinton. ImageNet not only led the deep learning revolution but also set a precedent for other datasets. Since its creation, tens of new datasets were introduced with more abundant data and more precise classification. Now, they allow researchers to create better models. Not only that, but research labs have focused on releasing and maintaining new datasets for other fields like the translation of texts and medical data.
Figure: Inception Module included in GoogleNet.
In 2015, Google released a new convolutional neural network called Inception or GoogleNet.* It contained fewer layers than the top performing neural networks, but it performed better. Instead of adding one filter per layer, Google added an Inception Module, which includes a few filters that run in parallel. It showed once again that the architecture of neural networks is important.
Figure: ImageNet Top-5 accuracy over time. Top-5 accuracy asks whether the correct label is in at least the classifier’s top five predictions.
ImageNet is considered solved, reaching an error rate lower than the average human and achieving superhuman performance for figuring out if an image contains an object and what kind of object that is. After nearly a decade, the competition to train and test models on ImageNet. Li tried to remove the dataset from the internet, but big companies like Facebook pushed back since they used it as their benchmark.
But since the ending of the ImageNet competition, many other datasets have been created based on millions of images, voice clips, and text snippets entered and shared on their platforms every day. People sometimes take for granted that these datasets, which are intensive to collect, assemble, and vet, are free. Being open and free to use was an original tenet of ImageNet that will outlive the challenge and likely even the dataset. “One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research,” Li said. “People really recognize the importance the dataset is front and center in the research as much as algorithms.”
Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.Edward Snowden*
In 2014, Tim received a request on his Facebook app to take a personality quiz called “This Is Your Digital Life.” He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task, so he quickly accepted the invitation. Within five minutes of receiving the request on his phone, Tim logged in to the app, giving the company in charge of the quiz access to his public profile and all his friends’ public profiles. He completed the quiz within 10 minutes. A UK research facility collected the data, and Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.
What Tim did not know was that he had just shared his and all of his friends’ data with Cambridge Analytica. This company used Tim’s data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information such as age, income, and gender, psychographic profiles explain why people make purchases. The use of personal data on such a scale made this scheme, which Tim passively participated in, one of the biggest political scandals to date.
Data has become an essential part of deep learning algorithms.* Large corporations now store a lot of data from their users because that has become such a central part of building better models for their algorithms and, in turn, improving their products. For Google, it is essential to have users’ data in order to develop the best search algorithms. But as companies gather and keep all this data, it becomes a liability for them. If a person has pictures on their phone that they do not want anyone else to see, and if Apple or Google collects those pictures, their employees could have access to them and abuse the data. Even if these companies protect against their own employees having access to the data, a privacy breach could occur, allowing hackers access to people’s private data.
Hacks resulting in users’ data being released are very common. Every year, it seems the number of people affected by a given hack increases. One Yahoo hack compromised 3 billion people’s accounts.* So, all the data these companies have about their users becomes a burden. At other times, data is given to researchers, expecting the best of their intentions. But researchers are not always sensitive when handling data. That was the case with the Cambridge Analytica scandal. In that instance, Facebook provided researchers access to information about users and their friends that was mainly in their public profiles, including people’s names, birthdays, and interests.* This private company then used the data and sold it to political campaigns to target people with personalized ads based on their information.
Differential privacy is a way of obtaining statistics from a pool of data from many people without revealing what data each person provided.
Keeping sensitive data or giving data directly to researchers for creating better algorithms is dangerous. Personal data should be private and stay that way. As far back as 2006, researchers at Microsoft were concerned about users’ data privacy and created a breakthrough technique called the differential privacy, but they never used it in their products. Ten years later, Apple released products on the iPhone using this same method.
Figure: How differential privacy works.
Apple implements one of the most private versions of differential privacy, called the local model. It adds noise to the data directly on the user’s device before sending it to Apple’s servers. In that way, Apple never touches the user’s true data, preventing anyone other than the user from having access to it. Researchers can analyze trends of people’s data but are never able to access the details.*
Differential privacy does not merely try to make users’ data anonymous. It allows companies to collect data from large datasets, with a mathematical proof that no one can learn about a single individual.*
Imagine that a company wanted to collect the average height of their users. Anna is 5 feet 6 inches, Bob is 5 feet 8 inches, and Clark is 5 feet 5 inches. Instead of collecting the height individually from each user, Apple collects the height plus or minus a random number. So, it would collect 5 feet 6 inches plus 1 inch for Anna, 5 feet 8 inches plus 2 inches for Bob, and 5 feet 5 inches minus 3 inches for Clark, which equals 5 feet 7 inches, 5 feet 10 inches, and 5 feet 2 inches, respectively. Apple averages these heights without the names of the users.
The average height of its users would be the same before and after adding the noise: 5 feet 6 inches. But Apple would not be collecting anyone’s actual height, and their individual information remains secret. That allows Apple and other companies to create smart models without collecting personal information from its users, thus protecting their privacy. The same technique could produce models about images on people’s phones and any other information.
Differential privacy, or keeping users’ data private, is much different from anonymization. Anonymization does not guarantee that information the user has, like a picture, is not leaked or that the individual cannot be traced back from the data. One example is to send a pseudonym of a person’s name but still transmit their height. Anonymization tends to fail. In 2007, Netflix released 10 million movie ratings from its viewers in order for researchers to create a better recommendation algorithm. They only published ratings, removing all identifying details.* Researchers, however, matched this data with public data on the Internet Movie Database (IMDb).* After matching patterns of recommendations, they added the names back to the original anonymous data. That is why differential privacy is essential. It is used to prevent user’s data from being leaked in any possible way.
Figure: Emoji usage across different languages.
This figure shows the usage percentage of each emoji over the total usage of emojis for English- and French-speaking countries. The data was collected using differential privacy. The distribution of the usage of emojis in English-speaking countries differs from that of French-speaking nations. That might reveal underlying cultural differences that translate to how each culture uses language. In this case, how frequently they use each emoji is interesting.
Apple started using differential privacy to improve its predictive keyboard,* the Spotlight search, and the Photos app. It was able to advance these products without obtaining any specific user’s data. For Apple, privacy is a core principle. Tim Cook, Apple’s CEO, has time and time again called for better data privacy regulation.* The same data and algorithms that can be used to enhance people’s lives can be used as a weapon by bad actors.
Apple predictive keyboard, with data it collected with differential privacy, helps users by offering the next word that should be in the text based on its models. Apple has also been able to create models for what is inside people’s pictures on their iPhones without having actual users’ data. It is possible for users to search for specific items like “mountains,” “chairs,” and “cars” in their pictures. And, all of that is served by models developed using differential privacy. Apple is not the only one using differential privacy in its products. In 2014, Google released a system for its Chrome web browser to figure out users’ preferences without invading their privacy.
But Google has also been working with other technologies to produce better models while continuing to keep users’ data private.
Google developed another technique called federated learning.* Instead of collecting statistics on users, Google developed an in-house model and then deployed it to each of the users’ computers, phones, and applications. Then, the model is trained based on the data generated by the user or that is already present.
For example, if Google wants to create a neural network to identify objects in pictures and has a model of how “cats” look but not how “dogs” look, then the neural network is sent to a user’s phone that contains many pictures of dogs. From that, it learns what dogs look like, updating its weights. Then, it summarizes all of the changes in the model as a small, focused update. The update is sent to the cloud, where it is averaged with other users’ updates to improve the shared model. Everyone’s data advances the model.
Federated learning* works without the need to store user data in the cloud, but Google is not stopping there. They have developed a secure aggregation protocol that uses cryptographic techniques so that they can only decrypt the average update if hundreds or thousands of users have participated.* That guarantees that no individual phone’s update can be inspected before averaging it with other users’ data, thus guarding people’s privacy. Google already uses this technique in some of its products, including the Google keyboard that predicts what users will type. The product is well-suited for this method since users type a lot of sensitive information into their phone. The technique keeps that data private.
This field is relatively new, but it is clear that these companies do not need to keep users’ data to create better and more refined deep learning algorithms. In the years to come, more hacks will happen, and users’ data that has been stored to improve these models will be shared with hackers and other parties. But that does not need to be the norm. Privacy does not necessarily need to be traded to get better machine learning models. Both can co-exist.
What’s not fully realized is that Moore’s Law was not the first but the fifth paradigm to bring exponential growth to computers. We had electromechanical calculators, relay-based computers, vacuum tubes, and transistors. Every time one paradigm ran out of steam, another took over.Ray Kurzweil*
The power of deep learning depends on the design as well as the training of the underlying neural networks. In recent years, neural networks have become complicated, often containing hundreds of layers. This imposes higher computational requirements, causing an investment boom in new microprocessors specialized for this field. The industry leader Nvidia earns at least $600M per quarter for selling its processors to data centers and companies like Amazon, Facebook, and Microsoft.
Facebook alone runs convolutional neural networks at least 2 billion times each day. That is just one example of how intensive the computing needs are for these processors. Tesla cars with Autopilot enabled also need enough computational power to run their software. To do so, Tesla cars need a super processor: a graphics processing unit (GPU).
Most of the computers that people use today, including smartphones, contain a central processing unit (CPU). This is the part of the machine where all the computation happens, that is, where the brain of the computer resides. A GPU is similar to a CPU because it is also made of electronic circuits, but it specializes in accelerating the creation of images in video games and other applications. But the same operations that games need in order to appear on people’s screens are also used to train neural networks and run them in the real world. So, GPUs are much more efficient for these tasks than CPUs. Because most of the computation needed is in the form of neural networks, Tesla added GPUs to its cars so that they can drive themselves through the streets.
Nvidia, a company started by Taiwanese immigrant Jensen Huang,* produces most of the GPUs that companies use, including Tesla, Mercedes, and Audi.* Tesla uses the Nvidia Drive PX2, which is designed for self-driving cars.* The Nvidia Drive processor has a specialized instruction set that accelerates neural networks’ performance at runtime and can compute 8 TFLOPS, meaning 8 trillion floating-point math operations per second.
The TFLOP (trillion floating-point math operations per second) is a unit for measuring the performance of chips used to compare the power that a certain chip has for processing neural networks.
Booming demand for Nvidia’s products has supercharged the company’s growth.* From January 2016 to August 2021, the stock has soared from $7 to $220.* Most of the money that Nvidia makes today comes from the gaming industry, but even though auto applications are a new field for them, they already represent $576M annually or 2% of its revenue. And the self-driving industry is just beginning.*
Video games were the flywheel, or the killer app as it’s called in Silicon Valley, for the company. They have an incredibly high potential sales volume and at the same time represent one of the most computationally challenging problems. Video games helped Nvidia enter the market of GPUs, funding R&D for making more powerful processors.
The amount of computation that GPUs, like CPUs, can handle has followed an exponential curve over the years. Moore’s Law is an observation that the number of transistors—the basic element of a CPU—doubles roughly every two years.*
Gordon Moore, co-founder of Intel, one of the most important companies developing microprocessors, created this law of improvement. The computational power of CPUs has increased exponentially. In the same way, the number of operations, the TFLOPS, that the GPUs can process has followed the same exponential curve, adhering to Moore’s Law.
But even with the growing capacity of GPUs, there was a need for more specialized hardware developed specifically for deep learning. As deep learning became more and more widely used, the demand for processing units tailored for the technique outgrew what GPUs could provide. So, large corporations started developing equipment specifically designed for deep learning. And, Google was one of those companies. When Google concluded that it needed twice as many CPUs as they had in their data centers to support their deep learning models for speech recognition, it created a group internally to develop hardware intended to process neural networks more efficiently. To deploy the company’s models, it needed to develop a specialized processor.
In its quest to make a more efficient processor for neural networks, Google developed what is called a tensor processing unit (TPU). The name comes from the fact that the software uses the TensorFlow language, which we discussed previously. The calculations like multiplication or linear algebra that TPUs handle do not need as much mathematical precision as the video processing that GPUs do, which means that TPUs need fewer resources and can do many more calculations per second.
Google released its first TPU in 2016. This version of their deep learning processor was solely targeted for inference, meaning it only focused on running networks that had already been trained. Inference works in such a way that if there is already a trained model, then that model can run on a single chip. But to train a model, you need multiple chips to get a fast turnaround, which keeps your programmers from waiting a long time to see if it works.
That is a much harder problem to solve because you need to interconnect the chips and ensure that they are in sync and communicating the appropriate messages. So, Google decided to release a second version of TPUs a year later with the added feature that developers could train their models on these chips. And a year later, Google released its third generation of TPUs that could process eight times more than the previous version and had liquid cooling to address their intense use of power.*
To have an idea of how powerful these processing chips are, a single second-generation TPU can run around 120 TFLOPS, or 200 times the calculations of a single iPhone.* Companies are at battle to produce hardware that can perform the fastest processing for neural networks. After Google announced its second-generation TPU units, Nvidia announced its newest GPU called the Nvidia Volta that delivers around 100 TFLOPS.*
But still, TPUs are around 15 to 30 times faster than GPUs, allowing developers to train their models much faster than with the old processors. Not only that, but TPUs are much more energy-efficient compared to GPUs, allowing Google to save a lot of money on electricity. Google is investing heavily in deep learning and related compilers, which is the part of the computer that makes human-readable code into machine-readable code. That means it needs improvements in the physical (hardware) and digital (software) space. This research and development field is so big that Google has entire divisions dedicated to making improvements in different parts of the pipeline of its development.
Google is not the only giant working on their own specialized hardware for deep learning. The latest processor of the iPhone 12 also has a specialized unit called the A14 bionic chip.* This little electronic unit can process up to 0.6 TFLOPS, or 600 billion floating-point operations per second. Some of that processing power is used for facial recognition when unlocking the phone, powering FaceID. Tesla has also developed its own processing chips to run its neural networks, improving its self-driving car software.* The latest chip Tesla developed and released can process up to 36 TFLOPS.*
The size of neural networks has been growing, and thus the processing power required to create and run the models has also increased. OpenAI released a study that showed that the amount of compute used in the largest AI training runs has been increasing exponentially, doubling every 3.5 months. And, they expect that the same growth will continue over the next five years. From 2012 to 2018, the amount of compute used to train these models increased 300,000x.*
Figure: The amount of compute in petaflop/s-day used to train the largest neural networks. A petaflop is a computing speed of floating-point operations per second, and a petaflop/s-day represents that number of operations continued over a day, or about operations.
This growth has parallels in the biological world wherein there is a clear correlation between the cognitive capacity of animals and the number of pallial or cortical neurons. It should follow that the number of neurons of an artificial neural network simulating animals’ brains should affect the performance of these models.
As time passes and the amount of compute used for training deep learning models increases, more and more companies will develop specialized chips to handle the processing, and an increasing number of applications will use deep learning to achieve all types of tasks.
To understand the development of AI algorithms and the way they improve as they learn over time, it is really important to take a step back from artificial intelligence systems and focus on how brains function. As it turns out, AI systems work much the same way as human brains. So, I must first explain, at least at a high level, how animal, and specifically human, brains work.
The most important piece is the theory of Learning to Learn, which describes how the brain learns the technique to learn new topics. The human brain learns and encodes information during sleep or at least in restful awake moments—converting short-term memory to long-term—through hippocampal, visual cortex, and amygdala replay. The brain also uses the same circuitry that decodes the information stored in the hippocampus, visual cortex, and amygdala to predict the future. Again, much like the human brain, AI systems decode previous information to create future scenes, like what may happen next in a video.
The truth is that a human is just a brief algorithm—10,247 lines. They are deceptively simple. Once you know them, their behavior is quite predictable.Westworld, season two finale (2018)
Humans have long deemed ourselves as the pinnacle of cognitive abilities among animals. Something unique about our brains makes us able to question our existence and, at the same time, believe that we are king of the animal kingdom. We build roads, the internet, and even spaceships, and we are at the top of the food chain, so our brains must have something that no other brain has.* Our cognitive abilities allow us to stay at the top even though we are not the fastest, strongest, or largest animals.
The human brain is special, but sheer mass is not the reason why humans have more cognition than different animals. If that were the case, then elephants would be at the top of the pyramid because of their larger brains. But not all brains are the same.* Primates have a clear advantage over other mammals. Evolution resulted in an economical way in which neurons are added to their brains without the massive increase in average cell sizes seen in other animals.
Primates also have another advantage over other mammals in the ability to use complex tools. Humans aren’t the only primates who can do this: chimpanzees, for example, use sprigs to perform many tasks, from scratching their back to digging termites. Tool use isn’t restricted to primates, either. Crows also use sticks to extract prey from their hiding spaces. And, they can even make their sticks into better tools like by making a carving hook at the end of a twig to better reach their prey.*
Other animals also have similar cognitive abilities as humans. Chimpanzees and gorillas, which cannot vocalize for anatomical reasons, learn to communicate with sign language. A chimpanzee in Japan named Ai (meaning “love” in Japanese) plays games on a computer better than the average human.* With her extensive chimpanzee research, Jane Goodall showed that they could understand other chimpanzees’ and humans’ mental states and deceive others based on their behavior.* Even birds seem to know other individuals’ mental states. For example, magpies fetch food in the presence of onlookers and then move it to a secret location as soon as the onlookers are gone. Birds can also learn language. Alex,* an African gray parrot owned by psychologist Irene Pepperberg,* learned to produce words that symbolize objects.* Chimpanzees, elephants,* dolphins,* and even magpies* appear to recognize themselves in the mirror.*
So, what makes humans smarter than chimpanzees that are, in turn, smarter than elephants? Professor Suzana Herculano-Houzel’s research showed that the number of neurons in the mammalian cerebral cortex and the bird pallium has a high correlation with their cognitive capability.*
The cerebral cortex and bird pallium are the outermost part of the brain and more evolutionary advanced than other brain regions. The more neurons in these specific regions, regardless of brain or body size, the better a species performs at the same task. For example, birds have a large number of neurons compressed in their brain compared to mammals, even though the size of their brains are smaller.
Not only that, but the size of the neocortex, the largest and most modern part of the cortex, is also a constraint for group size in animals, meaning in social relationships.
Robin Dunbar suggests that a cognitive limit exists for the number of people you can maintain a relationship with. His work led to what is called Dunbar’s number, and he posits that the answer is 150 based on the size of the human brain and the number of cortical neurons.*
Figure: Animals’ cognitive ability and the respective number of cortical and pallial neurons in their brains.* This image shows that there is a clear correlation between cognitive ability and performance, and the number of cortical or pallial neurons. The performance % on the y axis is the completion of a simple task.
There is a simple answer for how our brains can be at the same time similar to others in its evolutionary constraints and yet so advanced to create language and develop tools as complex as we do. Being primates bestows upon humans the advantage of a large number of neurons packed into a small cerebral cortex.*
What do animal brains have to do with AI systems and humans? First, the cognitive capacity of some animals suggests that we are not as unique as some think. While some argue that there are certain capabilities that only humans can perform, they have been proven wrong time and again. Second, the correlation of cognitive ability and the number of neurons might be an indication that neural networks will perform better as the number of artificial neurons increases. These artificial neural networks, of course, need the correct data and right type of software, as discussed in the previous section.
While the number of neurons affects animals’ cognitive ability, their brains have much more neurons than most deep learning models. Today’s neural networks have around 1 million neurons, about the same number as a honeybee. It might not be a coincidence that as neural networks increase in size, the better they perform at different tasks. As they approach the number of neurons in a human brain, around 100 billion neurons, it could be that they will perform all human tasks with the same capability.
A clear correlation exists between the cognitive capacity of animals and the number of pallial or cortical neurons. Therefore, it follows that the number of neurons in an artificial neural network should affect the performance of these models since neural networks were designed based on how neurons interact with each other.
A neural network can represent any kind of program, and neural networks that have a larger number of neurons and layers can represent more complex programs. Because more complex problems require more complicated programs, larger neural networks are the solution. As machine learning evolved to make more efficient algorithms, neural networks needed more layers and neurons. But with that advancement came the problem of figuring out the weights of all these neurons.
With 1,000 connections, at least configurations are possible, assuming that each weight can be either 0 or 1. Since the weights are usually real numbers between 0 and 1, the number of configurations is infinite. So, figuring out the weights became intractable, but backpropagation solved this problem. That technique helped researchers determine the weights by changing them on the last layer first, and then going down the layers until reaching the first one. This made the problem more tractable and allowed developers and researchers to use multilayer neural networks for different algorithms. By the way, this work was conducted independently from research in neuroscience.
Years of research demonstrated that the backpropagation technique used in computer science also happens in the brain. Neuroscientists have models that might show that the human brain could employ a similar method for learning, and the brain performs the same learning algorithm that researchers created to update their artificial neural networks. Short pulses of dopamine* are released onto many dendrites, driving synaptic learning in the human brain—part of the neuron-prediction error from a failure to predict what was expected. In deep learning, backpropagation works by updating the neural network weights based on the prediction error of the model’s output compared to the expected output. Both the brain and artificial neural networks use these errors to update the weights or synapses. Research on the brain and in computer science seem to converge. It is as if mechanical engineers developed airplanes merely to figure out that birds use the same technique. In this case, computer scientists developed artificial neural networks that demonstrate how brains work.
Human brains* and AI algorithms developed separately and over time, but they still perform in similar ways. It might not be a coincidence that billions of years of evolution led to better-performing algorithms as well as improved techniques to learn and interact with the environment. Therefore, it is valuable to understand how the brain operates and compare it to the software that computer scientists develop.
The algorithms that are winning in games like Go or Dota 2 use reinforcement learning to train multilayer neural networks. The animal brain also uses reinforcement learning via dopamine. But research shows that the human brain performs two types of reinforcement learning on top of each other. This new theory implements a technique called Learning to Learn, also called meta-reinforcement learning, which may benefit machine learning algorithms.
Dopamine is the neurotransmitter associated with the feeling of desire and motivation.
Neurons release dopamine when a reward for an action is surprising. For example, when a dog receives a treat unexpectedly, dopamine is released in the brain. The reverse is also true. When the brain predicts a reward and the animal does not get it, then a dip in dopamine occurs. Simply put, dopamine serves as a way for the brain to learn through reinforcement learning.
These dopamine fluctuations are what scientists call signaling a reward prediction error. There is a burst of dopamine when things are better than expected and a dip when things are worse. Dozens of studies show that the burst of dopamine, when it reaches the striatum, adjusts the strength of synaptic connections. How does that drive behavior? When you execute an action in a particular situation, if an unexpected reward occurs, then you strengthen the association between that situation and action. Intuition says that if you do something and are pleasantly surprised, then you should do that thing more often in the future. And if you do something and are unpleasantly surprised, then you should do it less often.
Inside people’s brains, the levels of dopamine increase when there is a difference between the predicted reward and the reward for a task. But dopamine also rises when it predicts that a reward is about to happen. So, it tricks people’s brains into doing work even if the reward does not come. For example, when you train a dog to do something like come to you when you blow a whistle, dopamine is what drives the synaptic change. You teach your dog to come when called by rewarding him, like giving him a treat, when he does what you want. After a while, you no longer need to reward the dog because his brain releases dopamine, expecting the reward (treat). Dopamine is part of what is known as model-free reinforcement learning.
But that is not the only system in people’s brains benefiting from reinforcement learning. The prefrontal cortex, the part of the cortex that is at the very front of the brain, also uses reinforcement learning rewards in its activities, or dynamics.
The prefrontal cortex together with the rest of the brain has two circuits that create what is called Learning to Learn. Model-based learning occurs via dopamine and model-free learning acts on top of that circuit in the prefrontal cortex.
One way to describe the difference between model-free and model-based reinforcement learning is that the latter uses a model of the task, meaning an internal representation of task contingencies. If I do this, then this will happen, or if I do that, then the other thing will happen. Model-free learning, however, does not do that. It only responds to the strengthening or weakening of stimulus-response associations. Model-free learning does not know what is going to happen next and simply reacts to what is happening now. That is why a dog can learn, with dopamine, how to come when called even if you stop giving it treats. It had no model of the event but learned that the stimulus, like whistling, is a good thing.
If the dopamine learning mechanism is model-free, then it should not reflect something called inferred value. I explain what that means with the following experiment will help explain this concept
A monkey looks at a central fixed point and sees targets to the left and right. If the monkey moves its eyes to a target, it is given a reward or not, depending on what side he was asked to look toward. Sometimes the left is rewarded and other times the right. These reward contingencies remain the same for a while and then reverse in a way not signaled to the animal, except by the rewards themselves. So, let’s say that the left is rewarded all the time and the right is not, but suddenly, the right is rewarded all the time and that continues for a while.
Initially, the monkey received a reward for looking left, and the brain immediately received dopamine. In this case, if the monkey looks right, dopamine is not released because the monkey is not going to get a reward. But at the moment of reversal, the monkey thinks it will receive a reward for looking left, but it receives nothing. When the target changes to the right, the monkey receives a reward for that new task. Once the animal understands the new task, then looking to the left should no longer trigger the dopamine response because the animal has experience and evidence to say that there is a reversal. The task that used to excite dopamine disappoints the dopamine system, and the target that did not previously stimulate the dopamine system now does. The animal has experienced a stimulus-reward association, and the dopamine system adjusts to that.
But consider a different scenario. The animal was rewarded for looking left, but in the next trial, the right is the target. It has no experience with the right in this new regime. But what you find is that if the right was not rewarded before and the animal infers that the right should be rewarded, then dopamine is released. Since the monkey knows that there has been a reversal now, it can tell that the next target should be rewarded. This is a model-based inference since it draws on the knowledge of the task, and that presumed reward is called inferred value.
Given the concept of inferred value, it is possible to determine that some parts of the brain learn via model-free and others from model-based reinforcement learning. The dopamine response clearly does not show inferred value because it is not based on a model of the task, but the brain still performs model-based reinforcement learning in its prefrontal cortex circuitry. The technique to show this is called a two-step task and works as follows.
Let’s say you play a game where you drive a car. The only two actions are turning left or right. If you turn left, then you die and lose the game. But if you turn right, then you continue playing the game.
If the driver plays the game again, a model-free system says, “If I turned right and did not die last time, then I should turn right again. Turning right is ‘good.’” A model-based system will understand the task at hand and will turn right when the road goes to the right and turn left when the road goes to the left. Therefore, someone who learns driving using a model-free reinforcement learning algorithm will never learn how to drive these roads properly. But a driver who learns to drive with a model-based algorithm will do just fine.
This simple task gives us a way of teasing apart model-free and model-based action selection. If you plot the behavior of the beginning of the trial, then you can show whether the system is a model-free or model-based reinforcement learning algorithm. The two-step task shows the fingerprint of the algorithm.
Studies with humans and even animals, including rats, that measure brain signals in the two-step task show that the prefrontal cortex presents the model-based pattern. In 2015, Nathaniel Daw demonstrated that behavior in the human prefrontal circuit via brain signals and the two-step task.* This implies that the prefrontal circuit learns from its own autonomous reinforcement learning procedure, which is distinct from the reinforcement learning algorithm used to set the neural network weights—the dopamine-based model-free reinforcement learning.
These two types of circuits work together to form what is known as Learning to Learn. Dopamine works on top of the prefrontal cortex as part of a model-free reinforcement learning system to update the circuit connections, while the prefrontal cortex circuit learns via model-based reinforcement learning.
The type of reinforcement learning implemented in the prefrontal circuit can be executed even when the synaptic weights are frozen. That means that the neural circuitry in the brain does not update the synapses’ weights to implement reinforcement learning.
It is different from the reinforcement learning algorithm accomplished by dopamine that trains the synaptic weights in the prefrontal cortex. In the prefrontal circuit, the task structure sculpts the learned reinforcement learning algorithm, which means that each task will have a different type of model-based reinforcement learning algorithm that runs in the prefrontal circuit.
In a different type of experiment, monkeys have two targets, A and B, in front of them and the reward probability between the two targets changes over time.* The monkey looks at the center point between the targets, and then it chooses to stare at one target or the other and receives a reward after a minute or so. This experiment showed that the brain has the two types of reinforcement learning algorithms working together, a model-free dopamine-based one on top of a model-based algorithm.
With that in mind, Matthew Botvinick designed a deep learning neural network that had the same characteristics as the brains of monkeys, that is, that learned to learn.
The results showed that if you train a deep learning system on this task using a reinforcement learning algorithm and without any additional assumptions, the network itself instantiated a separate reinforcement learning algorithm; that is, the network imitated what was found in the brain.*
And it is only after seeing man as his unconscious, revealed by his dreams, presents him to us that we shall understand him fully. For as Freud said to Putnam: ‘We are what we are because we have been what we have been.’André Tridon*
It is a well-known fact that memory formation and learning are related to sleep. A rested mind is more capable of learning concepts, and the human brain does not have as detailed a memory of yesterday as it has of the present day. In this chapter, I detail how the brain learns during sleep, describing hippocampal replay, visual cortex replay, and amygdala replay. They are all mechanisms the brain uses to convert short-term memory into long-term memory, encoding the knowledge stored throughout the day. The same circuitry responsible for decoding information from the neocortex to support memory recall is also used for imagination, which indicates that the brain does not record every moment and spends time learning during the night.
In 1995, the complementary learning systems (CLS) theory was introduced,* an idea that had its roots in earlier work by David Marr.* According to this theory, learning requires two complementary systems. The first one, found in the hippocampus, allows for rapid learning of the specifics of individual items and experience. The second, located in the neocortex, serves as the basis of the gradual acquisition of structured knowledge about the environments.
The neocortex gradually acquires structured knowledge,* and the hippocampus quickly learns the particulars. The fact that bilateral damage to the hippocampus profoundly affects memory for new information but leaves language, general knowledge, and acquired cognitive skills intact supports this theory. Episodic memory, that is the memory related to collections of past personal experiences occurring at a particular time and place, is widely accepted to depend on the hippocampus.
Figure: Hippocampus location inside the human brain.
The hippocampus is responsible for spatial memory (where am I?), declarative memory (knowing what), explicit memory (recalling last night’s dinner), and recollection (retrieval of additional information about a particular item like the color of your mother’s phone).
Hippocampal replay is the process by which, during sleep or awake rest, the same cells in the hippocampus activated during an initial activity are activated during sleep in the same order, or the completely reverse order, but at a much faster speed. Hippocampal replay has been shown to have a causal role in memory consolidation.
Howard Eichenbaum and Neal J. Cohen captured this view in 1988 with their suggestion that these hippocampal neurons should be called relational cells rather than the narrower term “place cells.”*
The hippocampus is an essential part of how memories form.* When a human experiences a new situation, the information about it is encoded and registered in both the hippocampus and cortical regions. Memory is retained in the hippocampus for up to a week after the initial learning. During this stage, the hippocampus teaches the neocortex more and more about the information. This process is called the hippocampal replay. For example, during the day, a mouse is trapped in a labyrinth and learns the path to get out. That night, the hippocampus replays the same neurons that were fired in the hippocampus and encodes the spatial information into the neocortex. The next time that the mouse is in the same labyrinth, it will know where to go based on the encoded information.
In this theory, the hippocampus, where synapses change quickly, is in charge of storing memories temporarily, whereas neocortical synapses change over time. Lesions made in the hippocampus and associated structures in animals are associated with deficits in spatial working memory and a failure to recognize familiar environments. Hence, consolidation may be an active process by which new memory traces are selected and incorporated into the existing corpus of knowledge at variable rates and with differential success according to their content.
The visual cortex presents the same kind of replay and acts in synchrony with the hippocampus.* Experiments show that the temporarily structured replay occurs in the visual cortex and hippocampus in an organized way called frames. The multicell firing sequences evoked by awake experiences replay during these frames in both regions. Not only that, but replay events in the sensory cortex and hippocampus are coordinated to reflect the same experience.
Frightening awake rats reactivates their brain’s fear center, the amygdala, when they next go to sleep.* In 2017, scientists at New York University (NYU), György Buzsáki and Gabrielle Girardeau, demonstrated this by adding rats to a maze and then giving them an unpleasant but harmless experience such as a puff of air.* From then on, the rats feared that place. “They slowed down before the location of the air puff, then [ran] super fast away from it.” The team also recorded the activity at the amygdala cells, which showed the same pattern of firing as the hippocampus. Their amygdalae became more active when they mentally revisited the fearsome spot.* These events may happen in order to store retained information in a different, lower-level part of the brain as well as in the neocortex, which is a more evolutionarily advanced part of the brain.
Buzsáki noted that it is unclear if the rats experienced this as a dream or if the experience led to nightmares. “We can’t ask them.” He went on to say, “It has been fairly well documented that trauma leads to bad dreams. People are scared to go to sleep.”
When people have new experiences, the memory formed by them is stored in the brain in different parts of the hippocampus and other brain structures. Different areas of the brain store different parts of the memory, like the location of where the event happened and the emotions associated with it.*
For a long time, neuroscientists who studied the brain believed that when we recall memories, our brains activate the same hippocampal circuit as when the memories initially formed. But a study in 2017,* conducted by neuroscientists at MIT, showed that recalling a memory requires a detour circuit, called a subiculum, that branches off from the original memory circuit.*
“This study addresses one of the most fundamental questions in brain research—namely how episodic memories are formed and retrieved—and provides evidence for an unexpected answer: differential circuits for retrieval and formation,” says Susumu Tonegawa, the Picower Professor of Biology and Neuroscience.*
The study also has potential insights regarding Alzheimer’s and the subiculum circuit. While researchers did not specifically study the disease, they found that mice with early-stage Alzheimer’s had difficulty recalling memories although they continued to create new ones.
In 2007, a study published by Demis Hassabis showed that patients with damage to their hippocampus could not imagine themselves in new experiences.* The finding shows that there is a clear link between the constructive process of imagination and episodic memory recall. We’ll discuss that further in the next chapter.
All low-level parts of the brain—including the hippocampus, visual cortex, and amygdala—replay during sleep to encode information. That is why it is easy to remember what you had for lunch on the same day but hard to remember what you ate yesterday. Short-term memories in the lower levels stay until your brain stores them and encodes all the knowledge during sleep. The neocortex stores relevant information encoded and compacted.
Deep neural networks also serve as a way of encoding information. For example, when a deep neural network classifies an image, it encodes it into the classified objects because the image contains more bits of data than merely a tag. An apple can look a thousand different ways, but they are all called apples. Turning short-term memory into long-term memory involves compressing all the information, including visual, tactile, and any other sensory material into compact data. So, someone can say that they ate a juicy apple yesterday but not remember all of the details of how the apple looked or tasted.
Memory recall and imagination serve as a way of decoding information from the higher parts of the brain, including the neocortex, into the lower parts of the brain, including the amygdala, visual cortex, and hippocampus. Memory recall and imagination may be only decoding the information that is stored in the neocortex.
John Anderton: Why’d you catch that?
Danny Witwer: Because it was going to fall.
John Anderton: You’re certain?
Danny Witwer: Yeah.
John Anderton: But it didn’t fall. You caught it. The fact that you prevented it from happening doesn’t change the fact that it was going to happen.
—Minority Report (2002)
A study in 1981 by James McClelland and David Rumelhart at the University of California, San Diego, showed that the human brain processes information by generating a hypothesis of the input and then updating it as the brain receives data from its senses.* They demonstrated that people are able to identify letters when situated in the context of words, compared to words without that semantic setting.
In 1999, neuroscientists Rajesh Rao and Dana Ballard created a computational model of vision that replicated many well-established receptive field effects.* The paper demonstrated that there could be a generative model of a scene (top-down processing) that received feedback via error signals (how much the visual input varied from prediction), which in turn led to updating the prediction. The process of creating the generative model of the scene is called predictive coding, whereby the brain creates higher-level information and fills in the gaps of what the sensory input generates.
Figure: An example of a sentence that has flipped words. The brain uses predictive coding to correct them.
An example of predictive coding is when you read a sentence that contains a word that is reversed or contains a letter in the middle that should not be there, like in the above image. The brain erases the error, and the sentence seems correct. This happens because the brain expects that the wording is correct when it is first encountered. As our brain processes the sentence, it predicts what should be written and sends that information downstream to the lower levels of the brain. Predictive coding works not only on sentences but also in many different systems inside the brain.
Figure: Predictive coding works in the brain, predicting which images are in the blind spot in people’s eyes.
The human eye has a blind spot, which is caused by the lack of visual receptors inside the retina where the optic nerve, which transmits information to the visual cortex, is located. This blind spot does not produce an image in people’s brains, but they do not notice the gap because the human brain fills it in in the same way the brain updates an incorrect word in a sentence. The human brain expects the missing part of the image even though it is not there. The brain takes care of filling in images and correcting words subconsciously.
Figure: Demonstration of the blind spot. Close one eye and focus the other on the letter R. Place your eye a distance from the screen approximately equal to three times the distance between the R and the L. Move your eye towards or away from the screen until you notice the letter L disappear.
To demonstrate that the blind spot is present in your eyes, place your eyes a distance equivalent to three times the distance between the R and L in the figure above. Close one of your eyes and focus the other eye on the appropriate letter. If the right eye is open, focus on the R, or vice versa. Move your closer or farther from the screen until the other letter disappears. The letter will disappear due to the eye’s blind spot.*
Yann LeCun, the Chief Artificial Intelligence Scientist at Facebook AI Research and founder of CNNs, is working on making predictive coding work in computers.*
In computer science, predictive coding is a model of neural networks that generates and updates a model of the environment, predicting what will happen next.
LeCun’s technique is called predictive learning, which alludes to the fact that it is trying to predict what is going to happen in the near future as well as fill in the gaps when information is incomplete or incorrect.* He developed the technique using generative adversarial networks to create a video of what is most likely to happen in the future. To achieve that, LeCun’s software analyzed video frames and, based on those, created the next frames of the video. The technique minimizes how different the generated frames are from the analyzed video frames, a measurement known as distance. For example, if the generated frames contain an image of a cat and the original frames do not, then the distance between the frames will be high. If they contain very similar elements, then the distance is small. Currently, the technique can predict up to the next eight frames in the future, but it is not too unthinkable to see a future where machines can predict future outcomes better than humans.
Figure: The first frame comes from a real video, and a machine predicts the next step of the video in the second frame.
The hippocampus is not only responsible for remembering but also for planning and future thinking, that is, constructing potential scenarios. Patients with hippocampal damage have difficulty imagining the future and are unable to describe fictitious scenes. Moreover, functional magnetic resonance imaging (fMRI) indicates multiple brain areas, including the hippocampus, engaged during remembering as well as imagining events.
Research shows that reversed hippocampal replay more frequently represents novel as opposed to familiar environments. This effect, measured by coactivations of cell pairs, was more pronounced on the first day of exposure to a novel environment than on subsequent days.
Generative adversarial networks serve as a way to construct images and scenarios. In a way, however, GANs decode information. Techniques exist to generate images based on a few parameters. For example, they can generate images of a smiling woman.
Similarly, the process of remembering or imagining the future, which is done by the hippocampus, is sometimes activated by the prefrontal cortex and is seen as decoding information with parameters. GANs consist of two neural networks, one that encodes information and the other that decodes it. In the same way, the human brain has two circuits that encode information from the hippocampus to the prefrontal cortex and decode information in the other direction. It will be no surprise if the same mechanism that trains GANs (and autoencoders) is done in the human brain.
GANs could serve to simulate the real world and are already used to create reproduced images and videos. The problem is that most of the best AI systems are made for game engines. Some argue that the reason why AI systems work so well in games is that game engines are their own version of the world. That means that AI systems can practice and learn in a virtual environment.
In the real world, for example, a self-driving system cannot drive a car off a cliff thousands of times to learn. In fact, a car driving off a cliff is already fatal, and a system that drives off a cliff once cannot work in the real world. Some say that to train an artificial intelligence system, it is necessary to train it in a simulated world. For supervised and unsupervised learning algorithms, the system must see at least 1,000 examples of what it’s trying to learn. Reinforcement learning algorithms also must practice and learn through many cases. Either researchers must create more efficient algorithms that can learn with fewer examples or reproduce many situations in which the system can acquire experience.
For games, you can use the game engine itself to train the system since all the constraints are defined there and already simulate many of these possible scenarios. So, if you design an AI agent to perform in a game, the agent can play multiple different variations that it wants to test and figure out the best move it should make in the future.
The problem with AI agents in the real world is that they are much more challenging to simulate compared to a game. No clear way exists of creating the real world and testing a few hypotheses. GANs might help solve this problem. LeCun is already using them to create future predictions of video frames. They may end up being used for more long-term predictions of the future. And, it would not be a coincidence that the brain also uses the same system for imagination and memory recall.
Humans may run simulations in their minds of possible scenarios and learn from those scenes. For example, they can imagine driving a car and the different situations that would arise based on the actions that they take. What would happen if they turn left instead of right? Some people argue that for computers to function as well as humans, they need to perform something similar. That means that they, with a few variables like turning left or right, can simulate and imagine the scenario and play it out to figure out the best action to take in the future based on that situation.
The Master created humans first as the lowest type, most easily formed. Gradually, he replaced them by robots, the next higher step, and finally he created me to take the place of the last humans.Isaac Asimov, I, Robot*
When people talk about artificial intelligence, they often think of mobile robots. But in computer science, AI is the field focused on the development of the brain of not only such robots but of computers that want to achieve certain goals. These robots do not use any of the deep learning models that we talked about previously. Instead, they have encoded, handwritten software.
In Florida, a few people watch a competition between robots to reach a specific goal while achieving the different objectives faster and more precisely than their opponents. One robot looks at a door with its sensors—cameras and lasers—to decide what to do next in order to open it. Using its robotic arm, it slowly pushes the door and goes to the other side. The team responsible for the robot cheers as it completes one of the tasks.
This story might sound like science fiction or from a distant future, but the US Defense Advanced Research Projects Agency (DARPA) organized that competition, the DARPA Robotics Challenge (DRC), in December 2013. Boston Dynamics created the robot that opened the door, Atlas, but many other robots also attempted these tasks. And for each robot, the development teams that programmed them eagerly watched.* The DRC’s goal was for robots to perform independent jobs inspired by situations dangerous to humans, like a nuclear power plant failure. The competition tested the robots’ agility, sensing, and manipulation capabilities. Upon first glance, the work seems pretty straightforward, like walking over terrain and opening doors, but they are difficult for robots to achieve. The most challenging assignment was to walk over an uneven surface because it is hard for robots to stay balanced. Most of the robots in the competition failed and did not complete many of the tasks because they malfunctioned or the job was too hard. Atlas achieved the most tasks of any of the competitors.
DARPA program manager, Gill Pratt, said of the prototype, “A 1-year-old child can barely walk, a 1-year-old child falls down a lot, this is where we are right now.”* Boston Dynamics revealed Atlas on July 11, 2013. At the first public appearance, the New York Times stated, “A striking example of how computers are beginning to grow legs and move around in the physical world,” describing the robot as “a giant—though shaky—step toward the long-anticipated age of humanoid robots.”*
Boston Dynamics has the bold goal of making robots that are better than animals in mobility, dexterity, and perception. By building machines with dynamic movement and balance, their robots can go almost anywhere, on any terrain on Earth. They also want their robots to manipulate objects, hold them steady, and walk around without dropping them. And, they are approaching their goals as time progresses. Atlas continues to improve with lighter hardware, more capabilities, and improved software.
Figure: The second version of Atlas.
Atlas was much more advanced than the first robots from the 1960s like Stanford’s Shakey. But Boston Dynamics wanted to improve their robot, so they designed a second version—Atlas, The Next Generation. They first released a YouTube video of it in February 2016 during which it walked on snow. Subsequent videos showed Atlas doing a backflip and jumping over a dog lying in the grass.*
To build this updated version, Boston Dynamics used 3D printing to make parts of the robot look more like an animal. For example, its upper leg, which has hydraulic pathways, actuators, and filters, are all embedded and printed as one piece. That was not possible before 3D printing. They designed the structure using the knowledge of Atlas’s loads and behaviors, based on data from previous interactions of the original Atlas robots with the environment. They also added software simulations. With the 3D-printing technique, Boston Dynamics transformed what was once a big, bulky, and slow robot weighing around 375 pounds into a much slimmer version at 165 pounds.*
Boston Dynamics is not only focused on building humanoid robots, but it is also developing different looking cyborgs as well. They have two robotic dogs, Spot and SpotMini.* Like Atlas, the dogs can enter areas unsafe for humans in order to clear out the space. Using cameras, the dogs look at the terrain, assess the elevation of the floor, and figure out where they can step and how to climb to another region.* These robotic machines continue to improve and become more agile and less clunky. The latest version dances to Bruno Mars’s hit song “Uptown Funk.” I believe this is only the beginning of the robotic revolution. Spot and other robots may end up in our everyday lives.
Giants like Amazon have been working on robots to increase their companies’ productivity. At an Amazon warehouse, small robots help packers for the online retail giant.* These automated machines cruise around the warehouse floor, delivering shelves full of items to humans, who then pick, pack, and ship the items without taking more than a couple of steps.
Figure: A Kiva robot in an Amazon warehouse.
This automation is a considerable change for Amazon, where humans used to select and pack items themselves with only the help of conveyor belts and forklifts. With the introduction of Kiva Systems’ robots, the Amazon warehouse processes completely changed. Now, humans stand in a set location, and robots move around the warehouse, alleviating most of the manual labor.
This change occurred when Amazon acquired Mick Mountz’s Kiva Systems for $775M in 2012.* After working years in business processes at Webvan, a now-defunct e-commerce startup, Mick realized that one of the reasons for the downfall was due to the high costs of order fulfillment.* In 2001, after the dot-com bubble exploded, the company filed for bankruptcy and later became part of Amazon. Mick found a better way to handle orders inside warehouses and started Kiva Systems with the help of robotics experts.
In a typical warehouse, humans fill orders by wandering through rows of shelves, often carrying portable radio-frequency scanners to locate products. Computer systems and conveyor belts sped things up but only to a point. With the help of robots, however, workers at Amazon process items three times faster and do not need to search for products. When an order comes into Amazon.com, a robot drives around a grid of shelves, locates the correct shelf, lifts the shelf onto its back, and delivers it to a human worker.* The person then completes the process by picking up the order, packing it, and shipping it. Humans do not get much rest, so to avoid human error, a red laser flashes on the item so that the human knows what to pick up. The robot, then, returns the shelf to the grid. As soon as the robot takes away the shelf, another one arrives so that the human is always working.
To function, robots need an operating system that can distill high-level instructions down to the hardware. This requirement is the same as for standard computers which need to communicate with their hard drive and display. Robots need to pass information to their components, like arms, cameras, and wheels. In 2007, Scott Hassan, an early Google engineer who previously worked with Larry Page and Sergey Brin, started Willow Garage to advance robotics. The team developed the Robot Operating System (ROS) for its own robots, one of which was the Personal Robot 2 (PR2). Ultimately, they shared the open-source operating system with other companies before closing their doors in 2014.*
The PR2 had two strong arms that performed delicate tasks like turning a page in a book. It contained pressure sensors in the arms as well as stereo cameras, a light detection and ranging (LIDAR) sensor, and inertial measurement sensors.* These sensors provided data for the robot to navigate in complex environments. Willow Garage developed ROS to understand the signals from these sensors as well as to control them.
Figure: Personal Robot 2.
ROS included a middle layer, which communicated between the software written by developers and the hardware, as well as software for object recognition and many other tasks.* It provided a standard platform for programming different hardware and a growing array of packages that gave robots new capabilities. The platform included libraries and algorithms for vision, navigation, and manipulation, among other things.
ROS enabled hobbyists and researchers to more easily develop applications on top of hardware. With ROS, robots play instruments, control high-flying acrobatic machines, walk, and fold laundry.* Currently, ROS is under development by other hardware businesses like self-driving car companies. The newest version of the software, ROS 2.0, has many new capabilities including real-time control and the ability to manage multiple robots. As these systems improve, we may eventually have robots performing our house cleaning chores.
Will robots inherit the earth? Yes, but they will be our children.Marvin Minsky*
Robots cannot yet operate reliably in people’s homes and labs nor manipulate and pick up objects.* If we are to have robots in our day-to-day lives, it is essential to create robots that can robustly detect, localize, handle and move, and change the environment the way we want. We need robots that can pick up coffee cups, serve us, peel bananas, or even walk around without tripping or hitting walls. The problem is that human surroundings are complex, and robots today cannot pick up most objects. If you ask a robot to pick up something it has never seen before, it almost always fails. To accomplish that goal, it must solve several difficult problems.
For example, if you ask a robot to pick up a ruler, the robot first needs to determine which object is a ruler, where it is, and finally, calculate where to put its gripper based on that information. Or, if you want a robot to pick up a cup of coffee, the robot must decide where to pick it up. If the gripper picks up the cup from the bottom edge, it might tip over and spill. So, robots need to pick up different objects from different locations.
The ultimate challenge for Amazon is to build robots to do all the picking, packing, and shipping in their warehouses,* and they are not resting in reaching that ambitious goal.* So, Amazon created the annual Amazon Picking Challenge, held from 2015 to 2017, for teams from around the world to make robots excel at picking up objects. This was the go-to competition for picking up and handling different objects. Amazon chose the items, and teams spent months optimizing their robots for the task. Unfortunately, none of the programmed robots could handle any object outside the original parameters, meaning that the robots were overtrained and incapable of learning outside the training data.
Figure: The robot that Stefanie Tellex programmed to pick up objects.
In Amazon’s challenge, each team created a robot with cameras and processors, which had to pick up items and place them in a specified location. The teams competed in front of an audience for prize money. One competition included 25 items for the robots to retrieve, including a rubber duck, a bag of balls, and a box of Oreo cookies. The teams had 20 minutes to fetch and package as many items as possible from an Amazon shelving unit.*
Some teams used claws, but most used a suction system attached to a vacuum. The challenge lies in the fact that the human hand has 27 degrees of freedom, and our brain recognizes numerous objects. But each year, teams performed better and better. At some point, instead of humans doing the tedious work of picking, packing, and shipping packages, robots will do it 24/7, delivering packages cheaper and faster.
To solve the problem of handling different objects, one approach was to make them identifiable to the robot by adding QR codes on top of the objects so that the robot knows exactly which object it is and how to handle it. This approach, however, does not work in the real world.
Figure: Professor Stefanie Tellex.*
To solve this problem, Professor Stefanie Tellex at Brown University works on a completely different approach. She makes robots learn on their own how to manipulate new objects by automating the process of learning how to pick them up. Robots need to learn how to pick up items just like humans do so that they do not need to study an object before they can pick it up. In other words, robots need to learn how to pick up new objects with high precision and high sensitivity (or recall).
Tellex built a system for robots that allows giving a robot a new object. To do that, she created a light view perception, which transforms the image it captures from its camera to a projection of the object, allowing the robot to pick it up. The system creates a synthetic camera using software to render an orthographic projection of the item. Tellex’s robot moves its arm above and around the object, taking multiple pictures with its camera and measuring its depth with its infrared sensor. By combining different representations, the robot can create an image of the object from any possible angle. The idea behind the technique is that the robot not only figures out the intensity of the light coming to the camera but also the direction of individual rays. This makes it possible to build a 3D model of the object and the scene. The robot is then able to detect and localize the object within two millimeters, which is the limitation of a camera.
Then, it tries different grasps to lift the item. Once gripped, the robot plays with the object to learn more about it and shakes it to make sure the grip is secure. When successful, the robot has learned to pick up a new item. After this learning experience, it can robustly manipulate this object. Not only that, but the robot can perform this learning process over and over again with different objects. Together with the light view perception system, Tellex’s group uses reinforcement learning to train robots to pick up unfamiliar objects even when lighting conditions are challenging. The robot learns by trying different grips and reinforcing behavior that seems to produce positive results. This allows Tellex’s robot to pick up objects in normally challenging situations, like grabbing a fork from a sink with running water, which would be extremely tricky to program manually. But all of this robotics development would not happen without training data or an operating system.
For robots to not have to manipulate and learn how to grip a new object every time it sees one, Tellex created a database of objects that robots would typically grasp.* She created a Million Object Challenge that accelerated the field by collecting and sharing data of these objects. People do not usually take pictures of door handles but instead take photos of more interesting things or selfies, so Tellex had to create a specific dataset for her needs.
Think of this as ImageNet for robots. The idea is that a robot can learn from this huge database, and when the robot is in a new environment, it will already know how to grip every object based on the data gathered by other robots. Tellex and her group have already collected data for around 200 items, including a plastic boat and a rubber duck, and other scientists can contribute their robot’s data to the project. Tellex’s goal is to build a library with one million different objects so that eventually, robots can identify any object in front of them and pick it up.
Researchers from Princeton and Stanford University, led by PhD student Angela Dai and Professor Thomas Funkhouser, created a dataset, ScanNet, that includes 3D views of thousands of scenes and millions of annotated objects like couches and lamps.* They created this dataset by scanning around 1,500 scenes using an iPad with an infrared depth sensor like the Microsoft Kinect. The resulting dataset is one order of magnitude larger than the second biggest dataset. Google’s AI research laboratory already uses this dataset to train its robots in simulations so that they can learn how to pick objects out of a bucket. ScanNet is extremely important for deep learning algorithms.
At the University of California, Berkeley, researchers also built a dataset comprising more than 1,000 objects with information of their 3D shape, visual appearance, and the physics of grasping them.* With such a dataset, the same researchers built robots that can pick up and shake objects in mid-air without dropping them 98% of the time. This is a much higher success rate compared to previous attempts. The results were, in part, because they trained the software for the robot in a 3D simulation before using it. The simulation-trained models are then successfully used in the physical world.
When performing tasks, robots still look robotic and their actions clunky because they follow a sense-plan-act paradigm.* That means that for every moment the robot interacts with the world, it must observe the world, create a model of what it senses, form a plan based on that, and then execute it. The old approach solved this problem modularly and tended not to work in cluttered environments, which are very natural in the real world. Perception is often imprecise, and so the models are often wrong and need to change.
To solve this problem and make robots move faster and be more reflexive, Google uses deep learning for its models, training the neural networks using a reinforcement learning algorithm, so that robots can act quickly. Google first trained their robots to imitate human behavior by observing human demonstrations of the intended action.* They built a robot that, for example, could pour from a cup after less than 15 minutes of observing humans performing this task from different viewpoints.
Google is also working with robot arms to make them learn how to grasp. They created a reinforcement learning algorithm that teaches a deep learning model used in the robot to learn how to grip objects.* It used seven robots with one experiment running a total of 800 total robot hours over the course of four months. With the information at hand, Google trained simulations of the robot with 10 GPUs and many CPUs, processing around 580,000 attempts.*
The learned model gave the robot arms a 96% success rate in 700 trials on previously unseen objects. Google showed that with deep learning and reinforcement learning, it is possible to train robots to grasp unknown objects successfully. Google is not the only company building deep learning models for robots; other research institutes, like OpenAI, have also done it successfully.*
Whether you think you can, or you think you can’t, you’re right.*
Many companies currently build technology for autonomous cars, and others are just entering the field. The three most transformative players in the space: Tesla, Google’s Waymo, and George Hotz’s Comma.ai. Each of these companies tackles the problem with very different approaches. In some ways, self-driving cars are robots that require solving both hardware and software problems. A self-driving car needs to identify its surrounding environment with cameras, radar, or other instruments. Its software needs to understand what is around the car, know its physical location, and plan the next steps it needs to take to reach its destination.
Tesla, founded by Martin Eberhard and Marc Tarpenning in 2003, is known as the Apple of cars because of its revolutionary car design and outside-the-box thinking when creating its vehicles.* Tesla develops its cars based on first principles, from the air conditioning system that uses perpendicular vents to how they form their chassis and suspension. With its innovation and work, the Tesla Model 3 is the safest car in the world,* followed by the Tesla Model S and Model X.* But Tesla is not only innovative with their hardware, it also invests heavily in its Autopilot technology.
In 2014, Tesla quietly installed several pieces of hardware to increase the safety of their vehicles—12 ultrasonic sensors, a forward-facing camera, a front radar, a GPS, and digitally controlled brakes.* A few months later, they released a technology package for an additional $4,250 to enable the use of the sensors. In a rapid release streak, Tesla launched features in the upcoming months, and a year later, rolled out its first version of the Autopilot—known as Tesla Version 7.0—to 60,000 cars.
Autopilot gave drivers features like steering within a lane, changing lanes, and automatic parking. Other companies, including Mercedes, BMW, and GM, already offered some of the capabilities, however. But self-steering was a giant leap toward autonomy that was released suddenly, overnight, as a software update. Tesla customers were delighted with the software update, releasing videos on the internet of the software “driving” their Teslas, hands-free.
Tesla not only makes the software but also the hardware for its cars, enabling it to release new features and update its software over the air (OTA). Because it has released cars that have the necessary hardware components for self-driving capability since 2014, Tesla has a widely distributed test fleet. Other car manufacturers, like Google and GM, only have a small fleet of cars with the required hardware for self-driving.
From the introduction of the Tesla hardware package until November 2018,* a total of 50 months, Tesla accrued around 1 billion miles driven with the newest hardware.* Not only that, but the Tesla servers store the data these cars accumulate so that the Autopilot team can make changes to its software based on what it learns. At the time of this writing, Tesla had collected around 5.5 million miles of data per day for its newest system, taking only around four hours to gather 1 million miles. For comparison, Waymo has the next most data with about 10 million miles driven in its lifetime. In two days, Tesla acquires more data from its cars than Waymo has in its lifetime.
This data collection rate increases with more cars on the streets, and Tesla has been speeding up their production pace. Even though Tesla has more miles accumulated than its competitors,* when it tested its self-driving capability with the California Department of Motor Vehicles (DMV)—the state government organization that regulates vehicle registration—Tesla had a much higher count of disengagements compared to other competitors.*
Disengagements are a metric that the average person can use to compare autonomous systems.* It provides a rough count of how often the car’s system fails so badly that the test driver takes over. It is only a proxy of the performance because this metric does not take into account variables that may affect the vehicle, like weather, or how and where these problems occurred. An increase in disengagement could mean that a major problem exists or that the company is testing its software in more challenging situations such as a city.
At the end of 2015, Tesla numbers showed that it was far behind its competitors. If we normalize the numbers of miles per disengagement, Tesla had 1,000 times worse software compared to Waymo. But Tesla continues to hone its system, year after year. And, Tesla has an advantage over other carmakers: It can update the system over the air and make it better without having to sell new cars or have existing ones serviced.
Figure: Comparing miles per disengagement.*
Waymo’s self-driving fleet has the lowest number of disengagements per mile, but even this metric does not yet approach human performance. Waymo has 1 disengagement per 1,000 miles. If we consider a human disengagement as being when a human is driving and there is an accident, then theoretically, humans have around 100 times fewer disengagements than Waymo’s self-driving software.
But Tesla has another advantage: It has a large fleet of cars enabled for testing its newest self-driving car software update. This technology enables Tesla to develop software in-house and release it in shadow mode for millions of miles before releasing the software to the public.* Shadow mode allows Tesla to silently test its algorithms in customers’ cars, which provides the company with an abundant testbed of real-world data.
Figure: Image courtesy of Velodyne LiDAR.*
LIDAR or light detection and ranging is a sensor similar to a radar—its name came from a portmanteau of light and radar.* LIDAR maps physical space by bouncing laser beams off objects. Radar cannot see much detail, and cameras do not perform as well in conditions of low light or glare.* LIDAR lets a car “see” what is around it with much more detail than other sensors. The problem with LIDAR is that it does not work well in several different lighting conditions, including when it is foggy, raining, or snowing.*
Unlike other companies, Tesla bets that they can run a self-driving car that performs better than a human without a LIDAR hardware device.
Another problem is that LIDAR is expensive, originally starting at around $75K, although the cost is now considerably less,* and the hardware is bulky, resembling KFC buckets.* LIDAR helps autonomous cars process and build a 3D model of the world around them, called simultaneous localization and mapping (SLAM). Still, Tesla continues to improve their software and lower their disengagement rate, which is one of the reasons Tesla bet on not using such a device. To perform as well as humans, cars need the same type of hardware. Humans drive only with their eyes. So, it makes sense that self-driving cars could perform as well as humans with cameras alone.
A Tesla vehicle running the Autopilot software ran into a tractor-trailer in June 2016 after its software could not detect the trailer against the bright sky, resulting in the death of its driver. According to some, LIDAR could have prevented that accident. Since then, Tesla added radars to its cars for these situations. One of the providers of the base software, Mobileye, parted ways with Tesla because of the fatality. They thought Tesla was too bullish when introducing its software to the masses and that it needed more testing to ensure safety for all. Unfortunately, fatalities with self-driving software will always occur, just as with human drivers. Over time, the technology will improve, and the disengagement rates will decrease. I predict a time when cars are better than humans at driving, at which point cars will be safer drivers than humans. But deaths will inevitably occur.
Before that fatality, Tesla used Mobileye software to detect cars, people, and other objects in the street. Because of the split, Tesla had to develop the Autopilot 2 package from scratch, meaning it built new software to recognize objects and act on them. It took Tesla two years to be in the same state as before the breakup. But once it caught up with the old system, it quickly moved past its initial features.
For example, the newest Tesla Autopilot software 9.0, has the largest vision neural network ever trained.* They based the neural network on Google’s famous vision neural network architecture Inception. Tesla’s version, however, is ten times larger than Inception and has five times the number of parameters (weights). I expect that Tesla will continue to push the envelope.
Tesla is not the only self-driving company at the forefront of technology. In fact, Google’s Waymo was one of the first companies to start developing software for autonomous cars. Waymo is a continuation of a project started in a laboratory at Stanford 10 years before the first release of the Tesla Autopilot. It won the DARPA Grand Challenge for self-driving cars, and because of its notoriety, Google acquired it five years later, forming Waymo. Waymo’s cars perform much better than any other self-driving system, but what is surprising is that they have many fewer miles driven in the real world than Tesla and other self-driving car makers.*
The DARPA Grand Challenge began in 2004 with a 150-mile course through the desert to spur development of self-driving cars. During the first year, the winner, Waymo, completed seven of the miles, but every vehicle crashed, failed, or caught fire.* The technology required for these first-generation cars was sophisticated, expensive, bulky, and not visually attractive. But over time, the cars improved, needing less hardware. While the initial challenge was limited to a single location in the desert, it expanded to city courses in later years.
With Waymo as the first winner of the competition, they became the leader of the autonomous car sector. Having the lowest disengagement rate per mile of any self-driving car system means that they have the best software. Some argue that the primary reason for Waymo performing better than the competition is that it tests its software in a simulated world. Waymo, located in a corner of Alphabet’s campus, developed a simulated virtual world called Carcraft—a play on words referring to the popular game World of Warcraft.* Originally, this simulated world was developed to replay scenes that the car experienced on public roads, including the times when the car disengaged. Eventually, Carcraft took an even larger role in Waymo’s self-driving car software development because it simulated thousands of scenarios to probe the car’s capability.
Waymo used this virtual reality to test its software before releasing it to the real-world test cars. In the simulation, Waymo created fully modeled versions of cities like Austin, Mountain View, and Phoenix as well as other test track simulations. It tested different scenarios in many simulated cars—around 25,000 of these at any single time. Collectively, the cars drive about 8 million miles per day in this virtual world. In 2016 alone, the virtual autonomous cars logged approximately 2.5 billion virtual miles, much more than the 3 million miles Waymo’s cars drove on the public roads. Its simulated world has logged 1,000 times more miles than its actual cars have.
The power of these simulations is that they train and test the models with software created for interesting and difficult interactions instead of the car simply putting in miles. For example, Carcraft simulates traffic circles that have many lanes and are hard to navigate. It mimics when other vehicles cut off the simulated car or when a pedestrian unexpectedly crosses the street. These situations rarely happen in the real world, but when they do, they can be fatal. These reasons are why Waymo has a leg up on its competitors. It trains and tests its software in situations other competitors cannot without the simulated world, regardless of how many miles they log. Personally, I believe testing in the simulated world is essential for making a safe system that can perform better than humans.
The simulation makes the software development cycle much, much faster. For developers, the iteration cycle is extremely important. Instead of taking weeks like in the early days of Waymo’s software construction, the cycle changed to a matter of minutes after developing Carcraft, meaning engineers can tweak their code and test it quickly instead of waiting long periods of time for testing results.
Carcraft tweaks the software and makes it better, but the problem is that a simulation does not test situations where there are oil slicks on the road, sinkhole-sized potholes, or other weird anomalies that might be present in the real world but not part of the virtual world. To test that, Waymo created an actual test track that simulates the diverse scenarios that these cars can encounter.
As the software improves, Waymo downloads it to their cars and runs and tests it on the test track before uploading it to the cars in the real world. To put this into perspective, Waymo reduced the disengagement rate per mile by 75% from 2015 to 2016.* Even though Waymo had a head start in creating a simulated world for testing its software, many other automakers now have programs to create their own simulations and testbeds.
Some report that the strategy for Waymo is to build the operating system for self-driving cars. Google had the same strategy when building Android, the operating system for smartphones. They built the software stack for smartphones and let other companies, like Samsung and Motorola, build the hardware. For self-driving cars, Waymo is building the software stack and wants the carmakers to build the hardware. It reportedly tried to sell its software stack to automakers but was unsuccessful. Auto companies want to build their own self-driving systems. So, Waymo took matters into their own hands and developed an Early Rider taxi service with about 62,000 minivans.* In December 2018, Waymo One launched a 24-hour service in the Phoenix area that opened up its ride-sharing service to a few hundred preselected people, expanding its private taxi service. These vans, however, will have a Waymo employee in the driver’s seat. This might be the solution to run its self-driving cars in the real world at first, but it will be difficult to see that solution scale up.
One of the other most important players in the self-driving ecosystem is Comma.ai, started by a hacker in his mid-twenties, George Hotz, in 2015.* In 2007, at the age of 17, he became famous for being the first person to hack the iPhone to use on networks other than AT&T. He was also the first person to hack the Sony PlayStation 3 in 2010. Before building a self-driving car, Hotz lived in Silicon Valley and worked for a few companies including Google, Facebook, and an AI startup called Vicarious.
Figure: George Hotz and his first self-driving car, an Acura.
Hotz started hacking self-driving cars by retrofitting a white 2016 Acura ILX with a LIDAR on the roof and a camera mounted near the rearview mirror. He added a large monitor where the dashboard sits and a wooden box with a joystick, where you typically find the gearshift, that enables the self-driving software to take over the car. It took him about a month to retrofit his Acura and develop the software needed for the car to drive itself. Hotz spent most of his time adding sensors, the computer, and electronics. Once the systems were up and running, he drove the car for two and a half hours to let the computer observe him driving. He returned home and downloaded the data so that the algorithm could analyze his driving patterns.
The software learned that Hotz tended to stay in the middle lane and maintained a safe distance from the car in front of it. Two weeks later, he went for a second drive to provide more hours of training and also to test the software. The car drove itself for long stretches while remaining within the lanes. The lines on the dash screen—one showed the car’s actual path and the other where the computer wanted to go—overlapped almost perfectly. Sometimes, the Acura seemed to lock onto the car in front of it or take cues from a nearby car. After automating the car’s steering as well as the gas and brake pedals, Hotz took the car for a third drive, and it stayed in the center of the lane perfectly for miles and miles, and when a car in front of it slowed, so did the Acura.
Figure: George Hotz’s self-driving car.
The technology he built as an entrepreneur represents a fundamental shift from the expensive systems designed by Google into much cheaper systems that depend on software more than hardware. His work impressed many technology companies including Tesla. Elon Musk, who joined Tesla after a Series A funding round and is their current CEO, and Holz met at Tesla’s Fremont, California, factory and discussed artificial intelligence. The two settled on a deal where Hotz would create software better than Mobileye’s, and Musk would compensate him with a contract worth about $1M per year. Unfortunately, Holz walked away after Musk continually changed the terms of the deal. “Frankly, I think you should just work at Tesla,” Musk wrote to Hotz in an email. “I’m happy to work out a multimillion-dollar bonus with a longer time horizon that pays out as soon as we discontinue Mobileye.” “I appreciate the offer,” Hotz replied, “but like I’ve said, I’m not looking for a job. I’ll ping you when I crush Mobileye.” Musk simply answered, “OK.”*
Since then, Holz has been working on what he calls the Android of self-driving cars, comparing Tesla to the iPhone of autonomous vehicles. He launched a smartphone-like device, which sells for $699 with software installed. The dash cam simply plugs into the most popular cars made in the United States after 2012 and provides the equivalent capability of Tesla Autopilot, meaning cars drive themselves on highways from Mountain View to San Francisco with no one touching the wheel.*
Figure: EON dash cam running chffrplus.*
But soon after launching the product, the National Highway Traffic Safety Administration (NHTSA) sent an inquiry and threatened penalties if Hotz did not submit to oversight considerations. In response, Hotz pulled the product from sale and pursued another path. He decided to market another product that was the hardware-only version of the product.
Then, in 2016, he open-sourced the software so that anyone could install it in the appropriate hardware. And with that, Comma.ai abstained from the responsibility of running its software in cars. But consumers still had access to the technology, allowing their cars to drive themselves. Comma.ai continues to develop its software, and drivers can buy the hardware and install the software in their cars. Some people estimate that around 1,000 of these modified cars run on the streets now.
new Recently, Comma.ai has announced that they have become profitable.*
Figure: The parts of the autonomous car’s brain.
Three main parts form the brain of an autonomous car: localization, perception, and planning. But even before tackling these three items, the software must integrate the data from different sensors, such as cameras, radars, LIDAR, and GPS. Different techniques ensure that if data from a given sensor is noisy, meaning it contains unwanted or unclear data, then other sensors help out with their information. And, there are methods for merging data from these different sensors.
Once data has been acquired, the next step for the software is to know where it is. This process includes finding the physical location of the vehicle and which direction the car needs to head, for example, which exits it needs to take to deliver the passenger to their destination. One potential solution is to use LIDAR with background subtraction to match the sensor data to a high-definition map.
Figure: Long tail.
The next part of the software stack is harder. Perception basically involves answering the question of what is around the vehicle. A car needs to find traffic signs and determine which color they are. It needs to see where the lane markings are and where cars, trucks, and buses are. Perception includes lane detection, traffic light detection, object detection and tracking, and free space detection.
The hardest part of this problem is in the long tail, which describes the diverse scenarios that show up only occasionally. When driving, that means situations like traffic lights with different colors from the standard red, yellow, and green or roundabouts with multiple lanes. These scenarios happen infrequently, but because there are so many different possibilities, it is essential to have a dataset large enough to cover them all.
The last step, path planning, is by far the hardest. Given the car’s location, its surroundings, and its passengers’ destination, how does it get there? The software must calculate the next steps to getting to the desired place, including route planning, prediction, behavior planning, and trajectory planning. The solution ideally includes mimicking human behavior based on actual data from people driving.
These three steps combine to form the actions cars need to take based on the information given. The system decides whether the vehicle needs to turn left, brake, or accelerate. The instructions fed to a control system ensure the car does not do anything unacceptable. This system comes together to make cars drive themselves through the streets and forms the “magic” behind cars driven by Tesla, Waymo, Comma.ai, and many others.
As stated earlier, traffic fatalities are inevitable, and, therefore, these companies must address the ethical concerns associated with the technology. The software algorithms determine what action autonomous vehicles perform. When a collision is unavoidable, in what order should the events occur?
This is a thought experiment described as the trolley problem. For example, it is a straightforward decision to have the car run into a fire hydrant instead of hitting a pedestrian. And while some may disagree, it is more humane to hit a dog in a crosswalk rather than a mother pushing a baby in a stroller. But that, I believe, is where the easy decisions end. What about hitting an older adult as opposed to two young adults? Or, in a most extreme case, is it better to choose to run the car off a cliff, killing the driver and all passengers, instead of plowing into a group of kindergarten students?*
Society sometimes focuses too much on technology instead of looking at the complete picture. In my opinion, we must encourage the ethical use of science, and, as such, we need to invest the proper resources into delving into this topic. It is by no means easy to solve, but allocating the appropriate means for discussing this topic only betters our society.
But the worries about operatorless elevators were quite similar to the concerns we hear today about driverless cars.Garry Kasparov*
There is a lot of talk about self-driving cars and how they will one day replace truck drivers, and some say that the transition will happen all of a sudden. In fact, the change will happen in steps, and it will start in a few locations and then expand rapidly. For example, Tesla is releasing software updates that make their car more and more autonomous. It first started releasing software that let its cars drive on highways, and with a later software update, its cars were able to merge into traffic and change lanes. Waymo is now testing its self-driving cars in downtown Phoenix. But it might not be surprising if Waymo starts rolling out their service in other areas.
The industry talks about five levels of autonomy to compare different cars’ systems and their capabilities. Level 0 is when the driver is completely in control, and Level 5 is when the car drives itself and does not need driver assistance. The other levels range between these two. I am not going to delve into the details of each level because the boundaries are blurry at best, and I prefer to use other ways to compare them, such as disengagements per mile. However they are measured, as the systems improve, autonomous cars can prevent humans from making mistakes and help avoid accidents caused by other drivers.
Self-driving cars will reduce and nearly eliminate the number of car accidents, which kill around 1 million people globally every year. Already, the number of annual deaths per billion miles has decreased due to safety features and improvements in the vehicle designs, like the introduction of seatbelts and airbags. Cars are now more likely to incur the damage and absorb the impact from an accident, reducing the injuries to passengers.
Figure: US vehicle miles traveled and proportionate mortality rate. Number of miles driven by cars versus the number of annual deaths per billion miles driven.
Autonomous driving will reduce the total number of accidents and deaths. In the United States alone, around 13 million collisions occur annually, of which 1.7 million cause injuries, and 35,000 people die. Driver error causes approximately 90% of the accidents, a third of which involve alcohol.* Autonomy can help prevent these disasters.
Deaths are not the only problem caused by accidents. They also have a huge economic effect. The US government estimates a cost of about $240B per year on the economy, including medical expenses, legal services, and property damage. In comparison, US car sales are around $600B per year. According to data from the US National Highway Traffic Safety Administration (NHTSA), the crash rate for Tesla cars was reduced by 40% after the introduction of the Autopilot Autosteer feature.* An insurer offered a 5% discount for Tesla drivers with the assist feature turned on.*
Autonomy will have an effect on traffic overall. Cars will not necessarily need to stop at traffic signs because they can coordinate among themselves to determine the best route or to safely drive at 80 miles per hour 2 feet away from each other. So, traffic flow will improve, allowing more cars on the streets. With fewer accidents, there might be less traffic congestion. Estimates say that as much as a third of car accidents happen because of congestion, and these create even more congestion. The impact of autonomy on congestion remains unclear since, to my knowledge, no studies exist yet. Self-driving cars will certainly increase capacity, but as the volume increases, so does demand. If it becomes cheaper or easier for people to use self-driving cars, then the number of people who use them will escalate.
Parking will also transform with autonomy because if the car does not have to wait for you within walking distance, then it can do something else when people do not need it.* The current parking model is a source of congestion, with some studies suggesting that a double-digit percentage of traffic in dense urban areas comes from people driving around looking for parking places. An autonomous car can wait somewhere else, and an on-demand car can simply drop you off and go pick up other passengers. But this new model might also create congestion because in both cases, the cars need to go pick up people rather than being parked and waiting for people to come to it. With enough density, the on-demand car might be the one that is already dropping off someone else close to you, similar to Uber’s model.
Parking is not only important for traffic but also for the use of land. Some parking is on the street, so removing it adds capacity for other cars driving or for people walking. For example, parking in incorporated Los Angeles County takes up approximately 14% of the land. Adding parking lots and garages is expensive, driving up construction prices and housing expenses.* A study in Oakland, California, showed that government-mandated parking requirements increased construction costs per apartment by 18%.
Removing the cost of drivers from on-demand services, like Uber and Lyft, reduces the expenditure by around 75%. Factor in the reduced price of insurance because of fewer car accidents, and the cost goes down even further. Transportation as a Service is the new business model.*
Transportation as a service is a type of service that enables consumers to move without having to buy or own vehicles.
Transportation as a Service (TaaS), also referred to as Mobility as a Service (MaaS) or Mobility on Demand (MoD),* will disrupt not only the transportation industry but also the oil industry with the addition of electric vehicles (EV). TaaS goes hand in hand with EVs: electric cars are much less expensive to maintain because, for one, their induction motors have fewer moving parts than the internal combustion engines (ICE) of gas-powered cars.* For autonomous vehicles in the TaaS sector, low maintenance costs are essential, as car rental companies know pretty well.
The average American family spends $9K on road transportation every year. Estimates are that they will save more than $5.6K per year in transportation costs with TaaS, leaving them to use that money in other areas like entertainment. Truly cheap, on-demand services will have even more consequences. As TaaS with self-driving cars becomes cheaper, we must rethink public transportation. If everyone uses on-demand services, then no one will need public transportation.
Transitioning all the people who are currently traveling through the underground subway system or elevated trains to cars on surface streets can increase congestion on the roads. In high-density areas, like New York City, people live in stacked buildings on different floors. If everyone needs to move at the same time, such as during rush hour, and go through only one “floor,” meaning the aboveground road system, then congestion will invariably happen. Therefore, self-driving vehicles need to be able to move around in a manner not dependent on only the surface streets. Autonomous vehicles should travel through many levels.
Figure: Kitty Hawk’s first prototype of its self-driving flying car.
One possibility is self-driving drones, something like a Jetsonian future. Kitty Hawk Corporation, a startup developed by Sebastian Thrun, already has a few prototypes of these flying cars.*
controversySome argue that this solution might not work inside highly dense areas because these drones produce too much noise. And if they fail and crash, they can damage property or humans.
The most recent prototype, however, is not as noisy as some claim. From a distance of 50 feet, these vehicles sound like a lawn mower, and from 250 feet, like a loud conversation. And, their design is such that if the motor or one of the blades fail, they will not fall to the ground.
Another possibility for adding more levels for on-demand vehicles is to go under the ground, creating tunnels. But digging tunnels is a huge financial and construction investment. Elon Musk’s Boring Company focuses on reducing the cost of tunneling by a factor of ten by narrowing the tunnel diameter as well as increasing the speed of their Tunnel Boring Machine (TBM).* Their goal is to make them as fast as a snail. Musk thinks that going underground is safer than flying vehicles and provides more capacity by adding more tunnels on different levels. The Boring Company already has a loop at the Las Vegas Convention Center.*
TaaS will have a direct impact on the driving industry as well as employment. In the United States alone, self-driving cars will impact around 200,000 taxi and private drivers and 3.5 million truck drivers.* Displacing truck drivers, in particular, will significantly impact the economy since truck driving is one of the largest professions in the United States.
Given that during peak driving hours only 10% of cars are in motion, we can expect that TaaS will result in fewer cars and that could affect production numbers. Over 10 million new cars are sold in the US market every year. With fewer needed for the same capacity, the total number introduced to the market might go down. Also, the cost of transportation will decline by a large factor because you need fewer resources to make the cars. Using TaaS will be much cheaper than owning a car because of the reduced usage as well as the fuel and maintenance savings when using EVs for autonomous driving.
Once in place, switching to TaaS is easy for consumers and requires no investment or contract, so I believe that the adoption rate will be high.* And as consumers’ comfort levels rise due to increased safety and less hassle, usage will spread. First, the switch will occur in high-density areas with high real estate values, like San Francisco and New York, and then it will spread to rural, less dense areas.
Figure: Cost difference of autonomous EV cars versus ICE cars.
As this shift occurs, fewer people will buy new cars, resulting in a decline in car production.* We already see this trend with young adults who use car-sharing services in cities and do not buy vehicles. According to one study,* young people drove 23% less between 2001 and 2009.* The car types that people drive will change over time as well. If you move to the city, you might not need an F-150 pickup truck but rather a much smaller car. Or, if you commute from one highly dense area to another, it might make sense to have autonomous vehicles that transport more than ten passengers at a time.
Figure: Percentage of drivers for different age groups.
The availability of on-demand, door-to-door transport via TaaS vehicles will improve the mobility of those unable to drive or who cannot afford to own a car. Because the cost of transportation will go down, more people will travel by car. Experiments with TaaS already exist in different areas of the US. For example, Voyage, a Silicon Valley startup acquired by Cruise in 2021, deployed cars with “remote” drivers that run its software in The Villages in Florida, a massive retirement community with 125,000 residents.* Voyage is already experimenting with what will become mainstream in a few years. Residents of the retirement community summon a car with a phone app, and the driverless car picks them up and drops them anywhere inside this community. The vehicles are monitored by workers who check for any problems from a control center. Transportation will completely change in the next decade and so will cities. Hopefully, governments will ease the transition.
Samantha: You know what’s interesting? I used to be so worried about not having a body, but now I truly love it. I’m growing in a way I couldn’t if I had a physical form. I mean, I’m not limited—I can be anywhere and everywhere simultaneously. I’m not tethered to time and space in a way that I would be if I was stuck in a body that’s inevitably going to die.Her (2013)
Voice assistants are becoming more and more ubiquitous. Smart speakers became popular after Amazon introduced Echo, a speaker with Alexa as the voice assistant, in November 2014. By 2017, tens of millions of smart speakers were in people’s homes, and every single one of them had voice as their main interface. Voice assistants are not only present in smart speakers but also in every smartphone. The most well-known one, Siri, powers the iPhone.
The debut impression of Apple’s Siri, the first voice assistant deployed to the mass market, occurred during a media event on October 4, 2011. Phil Schiller, Apple’s Senior Vice President of Marketing, introduced Siri by showing all its capabilities such as looking at the weather forecast, setting an alarm, and checking the stock market. That event was actually Siri’s second introduction. When first launched, Siri was a standalone app created by Siri, Inc. Apple bought the technology for $200M in April 2010.*
Siri was an offshoot from an SRI International Artificial Intelligence Center project. In 2003, DARPA led a 5-year, 500-person effort to build a virtual assistant, investing a total of $150M. At that time, CALO, Cognitive Assistant that Learns and Organizes, was the largest AI program in history. Adam Cheyer was a researcher at SRI for the CALO project, assembling all the pieces produced by the different research labs into a single assistant. The version Cheyer helped build, also called CALO at the time, was still in the prototype stage and was not ready for installation on people’s devices. Cheyer was in a privileged position to understand how CALO worked from end to end.
Cheyer split his time working at SRI as a researcher and helping SRI’s Vanguard program. Vanguard helped companies, like Motorola and Deutsche Telekom, test the future of a new gadget called the smartphone. Cheyer developed his own prototype of a virtual assistant, more limited than CALO but better for addressing Vanguard’s needs. The prototype impressed Motorola’s general manager, Dag Kittlaus, who unsuccessfully tried to persuade Motorola to use Vanguard’s technology. He quit and joined SRI as an entrepreneur-in-residence. Soon after, Cheyer, Kittlaus, and Tom Gruber started Siri, Inc. Their company had the advantage of being able to use CALO’s technology. Under a law passed by Congress in 1980, the non-profit SRI could give Siri, Inc. those rights in return for some of their profits. So, SRI licensed the technology in exchange for a stake in the new company.
Broadly, Siri’s technology had four parts. Speech recognition took place when you talked to Siri. The natural language component grasped what you said. Executing the request was the next part of the equation. The final element was for Siri to respond.*
For speech recognition, Siri used an entirely different approach than other technology at the time. The traditional method, as was used with IBM Watson, identified the linguistic concepts in a sentence, like the subject, verb, and object, and based on those, tried to understand what these pieces meant together.
Instead, the Siri team modeled real-world objects. When told, “I want to see a thriller,” Siri recognized the word “thriller” as a film genre and summoned movies rather than analyze how the subject connected to the verb or object. Siri mapped each question to a domain of potential actions and then chose the one that seemed most probable based on the relationship between real-world concepts. For example, if I said, “What time does the closest McDonald’s close?” Siri mapped this question to the genre of locals, found the McDonald’s closest to the current location, and queried the closing time. Siri then responded with the answer.
Siri also employed some additional tricks. In a noisy lobby, a request for the “closest coffee shop” might sound like “closest call Felicia,” but Siri knows that “closest” characterizes a place rather than a person, so it inferred that the question was probably related to a place and tried to get the gist of the sentence without understanding every word. Early on, the Siri creators saw virtually no limits on the routine tasks that the assistant could automate, but they also knew that their assistant would only succeed if it was both smart and fun to interact with. So, they programmed funny answers to offbeat questions. For example, if you ask Siri, “Tell me a joke,” one of the responses is, “The past, present, and future walked into a bar. It was tense.”
Three weeks after Siri launched on the App Store, Kittlaus received a personal call from Steve Jobs, the belated CEO of Apple, who wanted to buy the company and integrate Siri directly into the iPhone. Creating a voice interface was an area of interest for Jobs, and Kittlaus’s team had cracked the code. Siri, Inc. and Apple joined forces and launched Siri exclusively on the iPhone. And as a result, almost every consumer device connected to the internet today integrates a voice assistant or can interface with one.
Although Apple was the first major tech company to integrate a smart assistant into its phone operating system, other systems quickly caught up and surpassed Siri’s capabilities.* Amazon’s Alexa first appeared in 2014, and the Google Assistant followed in 2016. These newcomers offer more features and better voice recognition software. For example, the new Google Home speakers can recognize different people from the sound of their voices. If a person says, “Ok Google, call my dad,” the device knows to fetch the contacts of the person summoning the device. Google and Alexa also have done more with outsiders to work on their platform. Developers have built more than 25,000 Alexa skills, and the Amazon assistant is being integrated into cars, televisions, and home appliances.
More recently, Apple is catching up with its competitors. They transitioned the model behind its voice recognition system to a neural network in 2014.* Also, Siri now interprets commands more flexibly. For example, if I say to Siri, “Send Jane $20 with Square Cash,” the screen displays the text reflecting this request. Or, if someone says, “Shoot 20 bucks to my wife,” the same result happens. In 2017, Apple introduced a way for Siri to learn from its mistakes by adding a layer of reinforcement learning.* And in 2018, it created a platform for users to define shortcuts, allowing a customized set of commands.* For example, a user can create the command, “Turn the romantic mood on,” and configure Siri to turn smart lights on in a certain color and play romantic music. There are still gaps, but Siri’s capabilities continue to increase.
At a high level, a voice assistant brain’s is divided into a few main tasks:*
[Optional] Trigger command detection to recognize phrases like “Hey Siri” or “Hey Google” so that the device listens to the speech following it;
Automatic speech recognition to transcribe human speech into text;
Natural language processing to parse the text using speech tagging and noun-phrase chunking;
Question-and-intent analysis to analyze the parsed text, detecting user commands and actions such as “schedule a meeting” or “set my alarm”;
Data mashup technologies to interface with third-party web services, like OpenTable or Wolfram|Alpha, to perform actions, execute searches, and answer questions;
Data transformations to convert the output of third-party web services back into natural language text, like “today’s weather report” into “The weather will be sunny today”; and
Finally, text-to-speech techniques to convert the text into synthesized speech that the voice assistant speaks back to the user.
The first step on the iPhone uses a neural network that detects the phrase “Hey Siri.”* This step is a two-pass process. The first pass goes through a small, low-power auxiliary processor in the phone or speaker. The voice goes through a simple neural network that tries to identify if the sound is in fact “Hey Siri.” After this first pass, the voice goes to the main processor that runs a more complex neural network. The second step involves translating the speech to text. Speech is a waveform encoded as a bunch of bits (numbers). To translate it to text, Apple trained a neural network with data that has speech as input and the text corresponding to that speech as output.
With the exception of third-party services, all the steps in the process use a neural network. The rules to interact with these external applications, however, require handwritten code because each service provides a specific interface and certain information. For example, Foursquare provides data from businesses like restaurants, bars, and coffee shops. It can only return information about those businesses. If the voice assistant needs to figure out something else, like the weather for today or tomorrow, it must fetch information from weather.com or a similar service. By combining these steps, Siri and other voice assistants help people every day for tasks like setting their alarm for the next day and getting weather forecasts.
I will use treatment to help the sick according to my ability and judgment, but never with a view to injury and wrongdoing.Hippocratic Oath
Sebastian Thrun, who grew up in Germany, was internationally known for his work with robotic systems and his contributions to probabilistic techniques. In 2005, Thrun, a Stanford professor, led the team that won the DARPA Grand Challenge for self-driving cars. During a sabbatical, he joined Google and co-developed Google Street View and started Google X. He co-founded Udacity, an online for-profit school, and is the current CEO of Kitty Hawk Corporation. But in 2017, he was drawn to the field of medicine. He was 49, the same age as his mother, Kristin (Grüner) Thrun, was at her death. Kristin, like most cancer patients, had no symptoms at first. By the time she went to the doctor, her cancer had already metastasized, spreading to her other organs. After that, Thrun became obsessed with the idea of detecting cancer in its earliest stages when doctors can remove it.
Early efforts to automate diagnosis resembled textbook knowledge. In the case of electrocardiograms (ECG or EKG), which show the heart’s electrical activity as lines on a screen, these programs tried to identify characteristic waveforms associated with different conditions like atrial fibrillation or a blockage of a blood vessel. The technique followed the path of the domain-specific expert systems of the 1980s.
In mammography, doctors used the same method for breast cancer detection. The software flagged an area that fit a certain condition and marked the area as suspicious so that radiologists would review it. These systems did not learn over time: after seeing thousands of x-rays, the system was no better at classifying them. In 2007, a study compared the accuracy of mammography before and after the implementation of this technology. The results showed that after aided mammography, was introduced, the rate of biopsies increased and the detection of small, invasive breast cancers decreased.
Thrun knew he could outperform these first-generation diagnostic algorithms by using deep learning instead of rule-based algorithms. With two former Stanford students, he began exploring the most common class of skin cancer, keratinocyte carcinoma, and melanoma, the most dangerous type of skin cancer. First, they had to gather a large number of images to identify the disease. They found 18 online repositories of skin lesion images that were already classified by dermatologists. This data contained around 130,000 photos of acne, rashes, insect bites, and cancers. Of those images, 2,000 lesions were biopsied and identified with the cancer types he was looking for, meaning they had been diagnosed with near certainty.
Figure: Sebastian Thrun.
Thrun’s team ran their deep learning software to classify the data and then checked whether it actually classified the images correctly. They used three categories—benign lesions, malignant lesions, and non-cancerous growths. The team began with an untrained network, but that did not perform so well. So, they used an already trained neural network to classify images, and it learned faster and better. The system was correct 77% of the time. As a comparison, two certified dermatologists tested the same samples, and they were only successful 66% of the time.
Then, they widened the study to 25 dermatologists and used a gold standard test set with around 2,000 images. In almost every test, the computer program outperformed the doctors. Thrun showed that deep learning techniques could diagnose skin cancer better than most doctors.
Thrun is not the only one using deep learning to help advance the medical field. Andrew Ng, an adjunct professor at Stanford University and founder of Google Brain, leads a company, DeepLearning.AI, that teaches online AI courses. His company has also shown that deep learning algorithms can identify arrhythmias from an electrocardiogram better than experts.* Along the same lines, the Apple Watch 4 introduced a feature that performs an EKG scan. Previously, this was an expensive exam, so providing millions of people with a free test is significant for society.
Ng also created software using deep learning to diagnose pneumonia better than the average radiologist.* Early detection of pneumonia can prevent some of the 50,000 deaths the disease causes in the US each year. Pneumonia is the single largest infectious cause of death for children worldwide, killing almost a million children under the age of five in 2015.*
Deep learning systems for breast and heart imaging are commercially available,* but they are not running deep learning algorithms, which could improve detection greatly. Geoffrey Hinton, one of the creators of deep learning, said in an interview with The New Yorker, “It’s just completely obvious that in five years deep learning is going to do better than radiologists. … It might be ten years. I said this at a hospital. It did not go down too well.”* He believes that deep learning algorithms will also be used to help—and possibly even replace—radiologists reading x-rays, CT scans, and MRIs. Hinton is passionate about using deep learning to help diagnose patients because his wife was diagnosed with advanced pancreatic cancer. His son was later diagnosed with melanoma, but after a biopsy, it turned out to be basal cell carcinoma, a far less serious cancer
Cancer is still a major problem for society. In 2018, around 1.7 million people in the US were diagnosed with cancer, and 600,000 people died of it. Many drugs exist for every type of cancer, and some cancers even have more than one. The five-year survival rate for many cancers has increased dramatically in the past years, reaching 80% to 100% in some cases, with surgery and drug treatments. But the earlier cancer is detected, the higher the likelihood of survival. Preventing cancer from spreading into other organs and areas of the body is key. The problem is that diagnosing cancer is problematic. Many of the screening methods do not have high accuracy. Some young women disapprove of mammograms because of the many false positives, which create unnecessary worry and stress.
To increase survival rates, it is extremely important to detect cancer as early as possible, but finding an affordable method is difficult. Today’s process usually involves doctors screening patients with different techniques, including checking their skin to see patterns or tests like the digital rectal exam. Depending on the symptoms and type of cancer, the next step may involve a biopsy of the affected area, extracting the tumor tissue. Unfortunately, patients may have cancerous cells that have not yet spread, making detection even harder. And, a biopsy is typically a dangerous and expensive procedure. Around 14% of patients who have a lung biopsy suffer a collapsed lung.*
Freenome, a startup founded in Silicon Valley, is trying to detect cancer early on using a new technique called liquid biopsy.* This test sequences DNA from a few drops of blood. Freenome uses cell-free DNA, which are DNA fragments that are free-floating in people’s blood, to help diagnose cancer patients—Freenome’s name comes from shortening “cell-free genome.” Cell-free DNA mutates every 20 minutes, making it unique. People’s genome changes over time, and uninherited cancer comes from mutations and genomic instabilities that accumulate over time. Cell-free DNA flows through the bloodstream, and fragments of cancerous cells in one area may indicate cancer in another region of the body.*
Freenome’s approach is to look for various changes in cell-free DNA. Instead of only looking at DNA of tumor cells, Freenome has learned to decode complex signals coming from other cells in the immune system that change because of a tumor elsewhere. Their technology looks for the changes in DNA over time to see if there is a significant change compared to a baseline. It is hard, however, to detect cancer based on changes coded in someone’s DNA. There are around 3 billion bases in DNA, leading to a total of possible genomes. So, figuring out if a mutation in one of these genes is caused by another cell that has cancer is extremely hard. Using deep learning, Freenome’s system identifies the relevant parts in the DNA that a doctor or researcher would not be able to recognize. Who could have imagined that deep learning would play such an integral role in identifying cancer? My hope is that this technology will eventually lead to curing cancer.
Figure: Cost per genome over time versus how the price would be if it followed Moore’s Law.*
The first part of the problem involves checking people’s DNA with a simple blood test. While drawing the blood is simple, the test has been extremely expensive to carry out. But over time, genome sequencing has become cheaper and cheaper. In 2001, the cost per genome sequenced was in the order of $100M, but in 2020, the price has decreased to only $1K.* This trend shows no sign of slowing. If the price continues to follow the curve, it will be commonplace for patients to sequence their genome for a few dollars.* It may seem like science fiction now, but in a few years, we could detect cancer early on with only a few blood drops.
Proteins are large and complex molecules that are essential for sustaining life. Humans require them for everything like sensing light to turning food into energy. Genes translate into amino acids, which turn into proteins. But each protein has a different 3D structure, which determines what it can do. Some have a Y shape while others have a circular form. Therefore, identifying the 3D structure of a protein, given its genetic sequence, is of extreme importance for scientists because it can help them ascertain what each protein does. Distinguishing what the 3D structure of a protein looks like, which is determined by how the forces between the amino acids act, is an immensely complex problem known as the protein folding problem. Counting all possible configurations of a protein would take longer than the age of the universe.
But DeepMind tackled this problem with AlphaFold* by submitting it to CASP, a biennial assessment of protein structure prediction methods. CASP stands for Critical Assessment of Techniques for Protein Structure Prediction.* DeepMind trained their deep learning system using highly available data that maps genomic sequences to the corresponding proteins with their 3D structures.
Given a gene sequence, it is easy to map that to the sequence of amino acids inside the generated protein. With that sequence, DeepMind then created two multilayer neural networks. One predicted the distance of every pair of amino acids in that protein. The second neural network predicted the angles between chemical bonds connecting these amino acids. So, these two networks predicted which proteins’ 3D structures would be the closest to the one that these genes would generate. Given the closest protein structure, it used an iterative process and replaced some of the protein structures with new ones created using a generative adversarial network based on the gene sequence.* If the newly created protein structure had a higher score than the former protein structure, then that part of the protein was replaced. With this technique, AlphaFold determined protein structures much better than the next best contestant in the competition as well as all previous algorithms.
cautionBut with this new technology, we must return to a discussion of ethics. As long as humans have inhabited this Earth, we have searched for the fountain of youth, immortality. While some people see the quality of life as most important, others see longevity as key. Elizabeth Holmes and her role at the Theranos lab clearly demonstrate the risk of blindly accepting technology before being scientifically proven. Personally, I believe that AI plays a vital role in both increasing longevity as well as quality of life, but we must maintain strict testing and adherence to scientific principles.*
Imagination will often carry us to worlds that never were. But without it we go nowhere.Carl Sagan*
To analyze these images, however, the data needs proper classification. To solve this problem, Descartes Labs, a data analysis company, stitches together daily satellite images into a live map of the planet’s surface and automatically edits out any cloud cover.* With these cleaned-up images, they use deep learning to predict more accurately than the government the percentage of farms in the United States that will grow soy or corn.* Since the production of corn is a business worth around $67B, this information is extremely useful to economic forecasters at agribusiness companies who need to know how to predict seasonal outputs. The US Department of Agriculture (USDA) provided the prior benchmark for land use, but that technique used year-old data when released.
Figure: A picture of the yield forecast of different areas of the United States.
In 2015, for example, the FDA predicted a domestic production of 13.53 billion bushels of corn. Descartes Labs, however, forecasted 13.34 billion bushels, as seen in the picture above. Descartes Labs used an almost live view to visualize and measure developments such as floods or changes in crop condition. Using deep learning, the company exploited data from NASA and other sources and analyzed it faster than the government, predicting future yields based on the data collected.
The government spent endless resources surveying farmers across the country to identify the existing crops for each commodity in order to predict future yield. Descartes Labs eliminated this burden, reducing the cost of predicting the harvest. They trained their algorithm, which extracts valuable information from the satellite imagery, to predict future corn crops based on the color and appearance of the plants in the field.
And, this is just the beginning of extracting information from satellite images. Other startups are looking at different use cases. For example, Orbital Insight uses deep learning to scrutinize infrastructures, such as parking lots and oil storage containers, to predict and reveal important economic data.
Deep learning has not only been helpful in analyzing Earth, but also in discovering what is in the universe. With eight planets orbiting the Sun, our solar system held the title to the most planets around a star in the Milky Way galaxy. But in December 2017, NASA and Google discovered a new planet orbiting a distant star, Kepler 90, bringing the total number of planets for that star to eight as well. That discovery was no easy feat considering that the star is located over 2,500 light years away from us.
Using a telescope that has been searching for planets since 2009, NASA’s Kepler Telescope, scientists have discovered thousands of planets. Today’s difference is that instead of astrophysicists manually finding new discoveries, neural networks do the work.
Figure: Brightness* drop of the star.*
NASA’s Kepler Telescope shows data with the brightness of a star, based on images taken from the telescope. A planet can be spotted based on the change in the star’s brilliance. When a planet circles a star and passes between the star and telescope, it blocks some of the light the star emits. Based on the drop in brightness, it is possible to determine if a planet is circling a star. A planet shows up as a pattern that repeats every orbit as Earth’s view of the star is obscured. With that in mind, researchers defined a neural network to identify planets around a star.
This technique found two different planets around two separate star systems. The researchers plan to use the same method to explore all 150,000 stars that Kepler’s telescope has data on. This frees astrophysicists to research other areas since they do not need to look for a needle in a haystack, manually looking at every image to find patterns. A neural network does the work for them. “Machine learning really shines in situations where there is so much data that humans can’t search it for themselves,” stated Christopher Shallue.*
But these developments only scratch the surface. Deep learning broadens the horizon for the potential in space exploration. For example, at the International Space Station, Airbus’s small robot CIMON, Crew Interactive Mobile Companion, talked to German astronaut Alexander Gerst for 90 minutes on November 15, 2018.* Gerst used a language much like that used with voice assistants: “Wake up, CIMON.” CIMON is a very early demonstration of AI being used in space.*
Most people today rely on GPS to locate themselves, but that does not exist in space. So, NASA and Intel teamed up to solve problems of space travel and colonization using AI.* Intel hosted an eight-week program focused on this effort. One of the nine teams developed a tool to find one’s location in space by training a neural network to identify the position where a photo is taken. It trained a neural network to do so using millions of actual images as the training data.
So, today we concentrate on Earth and areas that we are familiar with, but space is vast with unlimited possibilities. Currently, over 57 startups exist in the space industry, focusing on areas such as communication and tracking, spacecraft design and launch providers, and satellite constellation operation.* This represents an enormous upsurge from 2012, which saw little funding and few dedicated companies.
If you double the number of experiments you do per year, you’re going to double your inventiveness.Jeff Bezos
Stitch Fix, an online clothing retailer started in 2011, provides a glimpse of how some businesses already use machine learning to create more effective solutions in the workplace. The company’s success in e-commerce reveals how AI and people can work together, with each side focused on its unique strengths.
Stitch Fix believes its algorithms provide the future for designing garments,* and, they have used that technology to bring their products to the market. Customers create an account on Stitch Fix’s website and answer detailed questions regarding things like their size, style preferences, and preferred colors.* The company then sends a clothing shipment to their home. Stitch Fix stores the information of what customers like and what they return.
The significant difference from a traditional e-commerce company is that customers do not choose the shipped items. Stitch Fix, like a conventional retailer, buys and holds its own inventory so that they have a wide stock. Using stored customer information, the company uses a personal stylist to select five items to ship to the customer. The customer tries them on in the comfort of their home, keeps them for a few days, and returns any unwanted items. The entire objective of the company is to excel at personal styling and send people things that they love. They seem to be succeeding, as Stitch Fix has more than 2 million active customers and a market capitalization of more than $2B.
The problem that Stitch Fix has is to select an inventory that matches their customers’ preferences. It does this in a two-stage process. The first step is to gather their customers’ data, information about its inventory, and the feedback that clients leave. Stitch Fix uses this knowledge to create a set of recommendations, using AI software, for what it should send to its clients.
The second step involves personal stylists determining which recommended items to actually send to customers. They also offer styling suggestions, like how to accessorize or wear the pieces, before boxing the items up and shipping them to customers. It is the creative combination of algorithmic prediction and human selection that makes Stitch Fix’s offering successful.
The reason the combination of humans and computers excel at personal styling is that humans are superior at using unstructured data, which is not easily understood by computers, and computers work best with structured data. Structured data includes details such as house prices and features of houses like the number of rooms, bathrooms, etc. Designs of clothes that are popular in a certain year are an example of unstructured data.
Not stopping there, Stitch Fix integrated Pinterest to the process by allowing customers to create boards of images that suit their style. Stitch Fix feeds that information to the customer’s profile, and the algorithm uses that information to more closely match pieces from their inventory. At this point, this information is more useful to the human stylists, but I do not doubt that the algorithms will continue to learn.
Other companies also use machine learning algorithms to recommend what clothes or products users want to see. For example, Bluecore focuses on helping e-commerce companies recommend what is best for their client’s customers. If a customer visits the Express website, a clothing company, and signs up with an email, Bluecore sees that the customer likes a specific shirt and that customers who like that shirt also like a particular pair of pants. Bluecore allows Express to send personalized emails and ads that contain the best set of products for that customer. The results are astounding for these companies. Customers end up buying much more because the type of clothing that they want to buy is offered directly to them with these personalized results.
Do you ever wonder how Amazon is so good at recommending products that appeal to you or how Facebook ads are (mostly) relevant? Well, they use machine learning to analyze your patterns based on your history as well as what others who looked at the same item as you ultimately bought. Data is continuously captured to make the buying experience better.
Justice cannot be for one side alone, but must be for both.Eleanor Roosevelt
Law seems like a field unlikely to make use of artificial intelligence, but that is far from the truth. In this chapter, I want to show how machine learning impacts the most unlikely of fields. Judicata creates tools to help attorneys draft legal briefs and be more likely to win their cases.
Judges presiding over court cases should rule fairly in disputes for plaintiffs and defendants. A California study, however, showed that judges have a pro-prosecutor bias, meaning they typically rule in favor of the plaintiff. But no two people are equal, and that is, of course, true of judges.* While this bias is a general rule, it is not necessarily true of judges individually.
For example, let’s use California Justices Paul Halvonik and Charles Poochigian to show how different judges are. Justice Halvonik was six times more likely to decide in favor of an appellant than Justice Poochigian. This might be surprising, but it is more understandable given their backgrounds.
Justice Halvonik, California’s first state public defender, was slated for the state Supreme Court. Unfortunately, a drug charge for possessing 300 marijuana plants curtailed that dream and ended his judicial career. Justice Poochigian, on the other hand, was a Republican State Assemblyman from 1994 to 1998. Republican Governor Arnold Schwarzenegger appointed him to the California Courts of Appeal in 2009.
While we should not boil their behavior down to stereotypes, we can look at the facts of each of their rulings and any trends that may develop from them. To do this, we must examine the context of the type of case and the procedural posture, meaning how similar cases were ruled on before.*
Judicata, a startup focused on using artificial intelligence to help lawyers, identifies statistics for each judge and uses those to see how a judge is likely to rule on a case. It takes into account their rulings based on the plaintiff or defendant and provides a glimpse of what other aspects might change what the judge will do, like the cause of action or appeal.
Judicata’s application, Clerk, was the first software to read and analyze legal briefs, which are written legal documents used in a court to present why one party should win against another.* Clerk’s purpose was to increase lawyers’ chances of winning a motion, that is, to win a request for the judge to decide the case.
Figure: An example of a score that Clerk generates.
Clerk analyzes legal briefs and evaluates them in three dimensions—arguments, drafting, and context—analyzing them in the following manner:
“Relying on strategic and favorable arguments.”
“Reinforcing those arguments with good drafting.”
“Presenting the context in which the brief arises in a favorable way.”*
The lawyer’s execution on these three dimensions can be judged by an objective measure, such as:
“Winning briefs perform better than losing briefs along each of these dimensions.”
“Higher scoring briefs have a better chance of winning compared to lower scoring briefs.”
The ability to grade a brief is crucial because whatever you can measure, you can improve. Given a brief, Judicata’s program analyzes arguments inside the legal brief and evaluates them, based on whether it contains logically favorable arguments. It analyzes all legal cases, legal principles, and arguments cited in the document and determines which ones are most prone to being attacked based on previous data.
Figure: Analysis of different arguments used.
Based on that information, it creates a snapshot of which arguments were used in which contexts. Some arguments are used for the defendant and others for the plaintiff, the party that initiated the lawsuit.
Figure: Cases that reference similar arguments as this brief.
Surprisingly enough, relying on arguments that were previously used on the same side as the lawyer works better. So, if you are a lawyer defending a case, it is better to use arguments that were used on the defendant’s side. Clerk also suggests arguments that have historically worked well for the party in question. Clerk benefits lawyers who want to create favorable and stronger arguments.
Figure: Suggestions of cases that can help the lawyer win the brief.
Whenever a lawyer writes a legal brief, it needs to include precedents, previous cases that support their case. Judicata found that the best cases to include were ones that matched the same desired outcomes that the brief is trying to achieve. Clerk analyzes previous legal cases and suggests precedents that were used in winning cases, identifying better cases to support the brief. The goal is to help lawyers present better drafted briefs.
Figure: Analysis of the draft.
Lawyers not only have to present good arguments and precedents, but they also need to address the opposition’s side. Clerk discovers how many arguments and precedents need to be addressed on both sides and suggests ones to add or remove. With that, lawyers present a stronger and more fair and balanced legal case.
Figure: Analysis of the arguments used by the opponent side.
Finally, Clerk analyzes what the outcome might be for a certain judge. Different judges analyze cases differently. So, depending on their historical decisions, Clerk gives a probability that the brief will succeed in each of the possible scenarios.
Figure: Probability of how a side of the case might win the case.
Even if the context a lawyer finds themselves in is not favorable, that does not mean all hope is lost. The lawyer merely needs to find historical cases that tilt this trend in their favor. And even if the lawyer does not have more than a 50% chance of winning the case, the ruling may still go in their favor. With Clerk, lawyers can better argue their case. Justice is said to be blind, but when it is not, machine learning can help lawyers make their case.
It amazes me how people are often more willing to act based on little or no data than to use data that is a challenge to assemble.Robert Shiller*
Homes are the most expensive possession the average American has, but they are also the hardest to trade.* It is difficult to sell a house in a hurry when someone needs the cash, but machine learning could help solve that. Keith Rabois, a tech veteran who served in executive roles at PayPal, LinkedIn, and Square, founded Opendoor to solve this problem. His premise is that hundreds of thousands of Americans value the certainty of a sale over obtaining the highest price. Opendoor charges a higher fee than a traditional real estate agent, but in return, it provides offers for houses extremely quickly. Opendoor’s motto is, “Get an offer on your home with the press of a button.”
Opendoor buys a home, fixes issues recommended by inspectors, and tries to sell it for a small profit.* To succeed, Opendoor must accurately and quickly price the homes it buys. If Opendoor prices the home too low, the sellers have no incentive to sell their house through the platform. If it prices the home too high, then it might lose money when selling the house. Opendoor needs to find the fair market price for each home.
Real estate is the largest asset class in the United States, accounting for $25 trillion, so Opendoor’s potential is huge. But for Opendoor to make the appropriate offer, it must use all the information it has about a house to determine the appropriate price. Opendoor focuses on the middle of the market and does not make offers on distressed or luxury houses because their prices are not predictable.
Opendoor builds programs that predict a house’s price.* It does that by analyzing features that a buyer in the market would think about and then teaching its models to look at those features. Opendoor analyzes three main factors:
the qualities of the home,
the home’s neighborhood, and
the prices of neighboring homes over time.
If you were to tell someone that you are selling a 2,000-square-foot home in Phoenix with two bathrooms and four bedrooms, can the buyer give a price? No, they cannot. The buyer has to see the home. Similarly, the Opendoor model needs to determine a house price from hard data that they’ve turned into something that is machine-readable and that algorithms can analyze. So, Opendoor also takes pictures of the house so that it can analyze more than the number of bedrooms and other features. Pictures show more qualitative and quantitative data compared to the number of rooms.
Pictures inform Opendoor about quantitative information like whether there is a pool in the backyard, the type of flooring, and the style of cabinetry. But other features are also important to pricing a home, and they are much harder to identify. For example, is the look and feel of the house good, and does it have curb appeal? Pictures fill in the details to the raw facts. While these characteristics are present in pictures, not all of them are easily identifiable by algorithms. Opendoor identifies these characteristics using both deep learning to extract some of the information into machine-readable information, and crowdsourcing, meaning using large numbers of people, to do some of the work. Opendoor needs crowdsourcing for the qualities that are less quantifiable in order to turn these visual signals into structured data.
After that, Opendoor takes the data and analyzes it, adding other factors, like which neighborhood the house is in and its location in that area. But that is not easy either because even if houses are close to each other, their prices vary depending on many other factors. For example, if a house is too close to a big, noisy highway, then the price of the house might be lower than a house in the same neighborhood but farther from the highway. Being located next to a football field or strip mall can affect the price. Many things impact a home price.
The next stage is determining the price of a home across time. The same home has a different price depending on when it is sold. So, Opendoor needs to identify how prices change over time. For example, before the bubble of 2008, home prices were extremely high, but they plummeted after the bubble burst. Opendoor must figure out what the price of a home should be, depending on the market at the time it’s being sold.
Figure: Price changes over time. The redder the dots are, the more expensive the houses.
The first image here shows the price of the homes in a normal market. The second image presents the prices of homes in Phoenix right before the housing bubble exploded. And, the third image depicts the prices of homes right after the housing bubble exploded.
Opendoor not only needs to think about price but also market liquidity: how long it takes on average for a home to sell in a certain market. How willing is the market to accept a home that Opendoor is about to buy and resell? Opendoor has to price the risk it takes when making an offer. Liquidity affects how many houses the company can buy in a certain period and how much risk it is taking on. The longer it takes for a house to sell, the higher the risk. The more the price can vary, the worse it is for Opendoor because it wants to pay a fair price for every single house.
Other competitors are catching up and offering similar services, which benefits customers. For example, in 2018, Zillow started offering a service to buy homes with an “all-cash offer,” requiring the customer to only enter information about the home, including pictures.* Zillow predicts the price of these houses with the help of machine learning.*
Artificial intelligence is also being used to predict customers who are likely to fail a credit check or default on their mortgage. This goes hand in hand with customer relationship management (CRM) systems by tracking when customers are likely to want to move. This same technology applies to property management to predict trends like property prices, maintenance requirements, and crime statistics.*
And finally, just as with AI impacting the job markets of truck and taxi drivers, the technology could mean fewer jobs for real estate agents.* I, however, predict collaboration between AI and humans like with Stitch Fix. There’s a personal, subjective component to real estate, so this field is the perfect opportunity to elevate the market and provide a better experience for home buyers and sellers with AI.
If you want to keep a secret, you must also hide it from yourself.George Orwell, 1984*
On a Saturday evening, Ehmet woke up as on any other day and decided to go to the grocery store near his home. But on the way to the store, he was stopped by a police patrol. Through an app that uses face recognition, the police force identified him as one of the few thousand Uyghur that lived in the region. Ehmet was sent to one of the “re-education camps” with more than a million other Uyghur Muslims.*
Even though this seems like a dystopian future, where people are identified by an all-present surveillant state, it is already happening under the Chinese Communist Party. George Orwell’s novel 1984 couldn’t be closer to reality. This scenario is unlikely to happen in other countries, but in this chapter, I go over some companies that are using the power of AI to surveil citizens elsewhere.
One of these companies turning the dystopian version of the future into reality is Clearview AI. Police departments across the United States have been using Clearview AI’s facial recognition tool to identify citizens. In fact, the main immigration enforcement agency in the US, the Department of Justice, and retailers including Best Buy and Macy’s are among the thousands of government entities and companies around the world that have used Clearview AI’s database of billions of photos to identify citizens.*
The company has users at the FBI, Customs and Border Protection (CBP), Interpol, and the New York Police Department.
Clearview’s system works by crawling through the open web for photos of people, creating a database based on those images and combining different photos based on people’s facial features.* It searches for pictures on websites like Facebook, Twitter, LinkedIn, MySpace, and even Tumblr. And it creates an offline database that is updated frequently, joining together all the photos pertaining to a single person.*
Someone at a police department who wants to search for a specific person can use the Clearview AI iPhone app to upload their picture, and the app can return the person’s full name as well as other pictures associated with them.
This tool is not only being used by governmental agencies to identify citizens; it has also been used by private companies to surveil people. Buzzfeed has uncovered through Clearview’s logs that about 2,900 institutions have used the company’s service to search for citizens around the world.*
In the US and other countries, some law enforcement agencies are even unaware that their officers and employees are using Clearview’s services. It is worrisome that this tool is being used without any oversight.
ShotSpotter is another tool using machine learning to aid police departments around the world. It has networks of microphones deployed in 110 different communities in the US, including New York City.*
The technology is not only being used by police departments to figure out if there is a possible shooting, but also by prosecutors as evidence of crimes, even though ShotSpotter hasn’t been fully tested for accuracy.
That is really troublesome as the tool has not been proven to be fully accurate and could falsely label other sounds as shots. The Associated Press has found that ShotSpotter evidence has been used in 200 court cases nationwide. Could this potentially lead to innocent people ending up in jail?
In one such case, court records show that ShotSpotter initially labeled a sound as fireworks. It was then relabeled by a human as a gunshot and used as evidence in a case. Either the human or the machine were wrong. Either case is not great.
We’ve all been there. You start watching a video on YouTube. Before you realize it, it’s 1 a.m., and you are watching videos about Greek philosophers and their influence in the modern world.
This is known as the “YouTube rabbit hole”—the process of watching YouTube videos nonstop. Most of these videos are presented by YouTube’s recommendation algorithm, which determines what to suggest you watch based on your and other users’ watch histories.
TikTok, Netflix, Twitter, Facebook, Instagram, Snapchat, and all services that present content have an underlying algorithm that distributes and determines the material presented to users. This is what drives YouTube’s rabbit hole.
For TikTok, an investigation done by the Wall Street Journal found that the app only needs one important piece of information to figure out what a user wants: the total amount of time a user lingers on a piece of content.* Through that powerful signal, TikTok can learn people’s interests and drive users into rabbit holes of content. YouTube’s and TikTok’s algorithms are all engagement-based, but according to Guillaume Chaslot, TikTok’s algorithms learn much faster.*
These services drive engagement by recommending content that users are likely to watch, but Netflix went a step further and personalizes thumbnail images of its shows to increase the click-through rate and total watch time. Netflix figured out that the thumbnail image that attracts a user to click depends on the type of movies that person likes to see. For example, if a user watches a lot of romance movies, the thumbnail should show an image of a romantic scene.
Let’s dive into one of these recommendation systems. We’ll look at YouTube’s system as that has been discussed publicly. Others’ systems work similarly.
The YouTube recommendation system works in two different stages. The first is for candidate generation, which selects videos that are possible options to be presented to users. The second stage is for ranking, which determines which videos are at the top and which are at the bottom of users’ feeds.*
Candidate generation takes users’ YouTube history as input. The ranking network operates a little differently. It assigns a score to each video using a rich set of features describing the video and the user. Let’s go over both stages.
The first stage’s model is inspired by the architecture of a continuous bag of words language model.* The continuous bag of words is a way of representing sentences as data points. It tries to predict the current target word (the center word) based on the source context words (surrounding words). That means that it just uses a small context around the target word to represent it.
The model will generate a representation of the video called an embedding. Then, the neural network is fed embeddings which have been learned from each video and are organized in a fixed vocabulary.
Data about each user’s viewing history is transformed into varying arrays of video IDs and mapped into a dense vector representation. With that, YouTube’s algorithm uses training data of past videos and their watch time to train their neural network to figure out the expected viewing time for other videos.
Models are typically biased from making predictions based on past data. But recent relevant content is vital to YouTube as a platform, as it helps keep users engaged and up to date. To correct for this, YouTube sets the age of the training data as a feature and optimizes it so that more recent videos are more likely to show up as candidates and at the top of the list.
The second part of the recommendation system involves ranking videos. In order to recommend quality content, YouTube needs a way to determine which content users are watching and enjoying.
Videos that retain the viewer’s attention are usually regarded as higher quality. In order to recommend quality videos, the model is trained so that it can predict how long a viewer will watch a video. This aspect also plays into how the algorithm ranks the videos.
With all of that, the team trained a neural network that takes inputs like the video ID, the watched video IDs, the video language, user language, time since last watch, number of previous impressions, and other features to predict the expected watch time. The click-through rate and the total amount spent per user increased based on YouTube’s recommendations.
YouTube’s algorithm is based on neural networks that aim to maximize engagement. That might be a good proxy for whether the user is enjoying watching those videos as the user is spending more time watching them. But there is not as much understanding of exactly what these neural networks are optimizing.
There is a risk that because these algorithms serve such a large percentage of the views, they can be controlled by a small group of people. For example, most of the social media platforms in China do not allow Chinese citizens to post images of Winnie the Pooh because it looks like the Chinese dictator, Xi Jinping.*
In the next section, I go over how researchers are trying to understand what these neural networks are doing under the hood.
”By the help of microscopes, there is nothing so small, as to escape our inquiry; hence there is a new visible world discovered to the understanding.”Robert Hooke*
Mary spent the whole morning on her TikTok getting videos about how lamps work. Her TikTok feed is mostly that and cute videos of dogs. As with many who have interacted with TikTok or other social media apps, she never noticed that most of her social media feed is determined mostly by algorithms that tell her what to watch next.
This isn’t a problem when she is watching videos of dogs, but one day she was browsing around and started watching depressing videos, and the algorithm just reinforced that.
A neural network is behind the videos that she watches, recommending 70% of them.* And the algorithm is mostly a black box. That is, the humans that wrote the neural network don’t know its exact inner workings. Most of what they know is that using these algorithms increases engagement. But is that enough?
If a lot of our lives is determined by what neural networks decide, from housing prices to driving our cars, it might be worth understanding how and why these neural networks are making their decisions.
That’s where interpretability of neural networks comes in. Understanding how these “black boxes” work might be important for understanding why different decisions are made and whether they are correct.
Many scientific discoveries have been made when scientists were able to “zoom in.” For example, microscopes let scientists see cells, and X-ray crystallography lets them see DNA. In the same way, AI scientists led by a young researcher, Chris Olah, have been studying and “zooming in” on neural networks that are used for image classification.*
In order to study those neural networks, the team at OpenAI analyzed each neuron on different neural networks and their features, as well as the connections between different neurons. To observe what different neurons represent in each neural network, the team analyzed how the neurons fire and activate when different images are run through the neural network. What they found was really interesting.*
The team created the equivalent of a microscope but for “visual” neural networks—neural networks that are used to detect objects in images. With Microscope, researchers can systematically visualize every neuron in common neural networks including InceptionV1. In contrast to the typical picture of neural networks as a black box, the researchers were surprised by how approachable the network is on this scale.
The neurons became understandable. Some represent abstract concepts like edges or curves, and others, features like dog eyes or snouts. The team also was able to explain the connections between each neuron. The connections represent meaningful algorithms. For example, a connection may correspond to joining two different layers together, one representing dogs in one orientation and the other representing dogs in another orientation. These connections, or “circuits,” can even represent simple logic, such as AND, OR, or XOR, over high-level visual features.
The researchers at OpenAI laid out a foundation to show that these neurons are probably mapping to these features. They didn’t prove that it was the case, but by testing the activation of such neurons with many different examples, they showed a causal link between the firing of these neurons and the images that they are purportedly representing. They’ve also shown that the neurons do not fire with images that are close to but not the same as those that these neurons are identifying.
Figure: InceptionV1 neural network representations and the union of the bottom two neural networks.
The OpenAI team showed that neurons can be understood and are representing real features.
That was not the only surprise found by these researchers. They also have found that the same features were detected across different neural networks. For example, curve detectors were found in the following neural networks: AlexNet, InceptionV1, VGG19, and ResnetV2-50.
The scientists detected that when training the same dataset with different neural networks, the same neurons were present in those networks. With that, they came up with a hypothesis that there is a universality of features in different networks. That is, if there are different architectures of neural networks trained in the same dataset, there are neurons that are likely to be present in all the different architectures.
Not only that, but they found complex Gabor detectors, which are usually found in biological neurons. They are similar to some classic “complex cells” of neuroscience. Could it be that our brain also has the same neurons present in artificial neural networks?
For now the Microscope has only been used to analyze neural networks that classify images, but it can be imagined that the same technique could be applied to other areas, including natural language processing.
Other tools have been developed for neural networks used in natural language processing. One recently developed by a group at Google is called the Language Interpretability Tool* and is used to understand NLP tasks. The open-source tool allows for rich visualizations of model predictions and includes aggregate analysis of metrics and slicing of the dataset.*
The tool uses a technique called UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction). With UMAP, you can visualize the classification of certain datasets on a projection of the dataset into a smaller plane. In that way, you can identify unexpected results from data. That means that if a dataset contains many features or can be represented in a multi-dimensional space, UMAP will transform the data points and the dataset into a representation in a lower dimension. For example, you can reduce the dimension of the data points so that you can see the points in a 3D graph. It includes several other capabilities, but is not as developed as OpenAI Microscope.*
All these tools to understand and interpret neural networks are in their infancy. Microscope and the Language Interpretability Tool are just two examples of tools that are starting to be developed to understand the internals of neural networks.
It is clear that we are still in the early days of creating tools for interpreting and understanding neural networks in different applications. Neural networks might still be complex to understand, but there are ways to investigate what each neuron in a network might be doing independently.
As we take for granted the microscope as an important scientific instrument, the creation of a neural network microscope might be an important step to understand them and may even help fix possible bugs that neural networks create.
We wanted flying cars, instead we got 140 characters.Peter Thiel*
Jennifer woke up early on Monday morning. Before going to work, she received a personalized message distilling all information that she needed to know for the day. She walked out of her house and hailed an autonomous car that was waiting for her. As her car rode from her home to her office, Jennifer’s AI assistant briefed her about her day and helped her make some decisions. She arrived at her office in just under ten minutes, going through an underground tunnel.
That’s a future that seems far off, but it might be closer than we think. Deep learning might make most of these predictions reality. It is starting to change the economy and might have a significant economic impact. ARK Invest, an investment firm based in New York, predicts that in 20 years, deep learning will create a $17 trillion market opportunity.* That is bigger than the economic impact that the internet had.
Even though these predictions are far off, deep learning is already having an impact on the world. It is already revolutionizing some fields in artificial intelligence. In the past seven years, machine learning models for vision and language have been completely overtaken by deep learning models. These new models outperform any other “old” artificial intelligence techniques. And every few months, a bigger and newer model outperforms state-of-the-art results.*
In recent years, due to the rapid progress in natural language processing and understanding, the AI community has had to develop new and harder tests for AI capabilities. Models are getting better so fast that researchers have to come up with new benchmarks almost every year.*
We are starting to see deep learning slowly affect our lives. The technology is being added to most major software packages to help people be more productive. Gmail’s Smart Complete is one of them. It helps people write emails faster by auto-completing sentences. Google is adding similar features to other products. With Android 10, Smart Reply was embedded into the operating system.
Other companies are also looking to improve their software with deep learning. Recently, Microsoft featured the work that OpenAI is doing with its language models. It demonstrated that it could automate* some of the work that software engineers do.
These features seem to have a small impact right now, but their effect on our lives will accelerate, and they will have a bigger impact than most predict.
From self-driving car systems to music recommendation engines, traditional software is slowly being replaced by trained neural networks. That, in turn, increases productivity for software engineers.
Deep learning is not only increasing productivity for software engineers and white-collar workers; other markets are also being disrupted. Transportation will see an increasing influence of artificial intelligence. Currently, there are around 3.5M truckers working in the United States.* With self-driving cars and trucks, most of these jobs will be replaced by computers.
Jobs being replaced doesn’t mean that the economy will implode. With automation, productivity in some areas increases, which frees capital to other areas of the economy. Other sectors of the economy have been growing steadily. For example, consumer spending as a percentage of GDP in food services and recreation services has been growing since the 60s.*
Figure: Spending on leisure and hospitality as percentage of the total economy.*
In a 2017 interview, Marc Andreessen, a famous investor in Silicon Valley, explained that there are two kinds of sectors of the economy: the fast-change sectors and the slow-change sectors.
The fast-change sectors include retail, transportation, and media. They are sectors in which technology has had an enormous impact. There is a massive change in those sectors, and there are massive productivity improvements, which cause gigantic churn in jobs. And at the same time, prices have fallen rapidly.
The other sectors, the slow-change sectors, include healthcare, education, construction, elder care, childcare, and government. In those sectors, the opposite is happening: there is a price crisis. The prices for products and services in these areas are rising fast. The Financial Times* showed that 88% of all the price inflation since 1990 is attributed to healthcare, construction, and education.
Marc Andreessen also stated that the worries of unemployment and job displacement come from the lump of labor fallacy.*
The lump of labor fallacy is the recurrent panic that happens every twenty-five to fifty years over whether the job market pool is fixed, meaning that an influx of workers, such as younger people, immigrants, or machines, will take all the jobs, driving out other workers. This effect never actually happens.
A good example of this fallacy happened with cars. When the automobile went mainstream 100 years ago, the same panic happened that may occur in the future with self-driving cars. At the time, people worried that all jobs for people whose livelihood depended on taking care of horses—everybody running stables, all the blacksmiths—were going to disappear.
But in reality, more jobs were created with the creation of cars. Manufacturing jobs in auto plants became a large sector of the economy. Car companies became such a huge employer that the US government had to bail out these companies in 2008 to keep all their employees working.
Not only that, but there were jobs created to pave streets for cars. A lot of concepts were built from what the creation of cars allowed. The idea of restaurants, motels, hotels, conferences, movie theaters, apartment complexes, office complexes, and suburbs were all expanded after the creation of cars.
The number of jobs created by the second, third, and fourth order effects of the creation of cars was one hundred times the number that disappeared. Marc Andreessen argues that with the creation of new technology, the efficiency of that market goes up, liberating capital that can be invested in other areas.
Others that are more concerned about the lack of innovation than the economic effects of innovation. In a few presentations, Peter Thiel argued that he is far more worried about the lack of good technologies than the danger of evil in technology applications or their consequences.
Peter Thiel argues that there hasn’t been much innovation in past years. For example, he argues that the nuclear industry has been dead for decades, while other promises like cleantech just became toxic words for losing money badly.
If technology has had such an impact on society, then the price of goods would have gone down. But Peter argues that, for example, the price of commodities has not gone down as technology expanded.
In fact, there was a famous bet between two economists, Simon and Ehrlich,* in the 80s. Simon said that the price of commodities would go down in the next decade, while Ehrlich said that it would go up. Simon was right in the 80s, meaning that commodity prices went down in that decade.
But if you look at the next decades, from 1993 to 2003, and 2003 to 2013, commodity prices have gone up, which would show that technology has not had as significant an effect on the economy as some people have predicted.
Peter Thiel stated that most innovation has happened only in the world of bits, and not the world of atoms, and that computers alone can’t do everything. He argued that people are free to do things in the world of bits, and not free to do stuff in the world of things.
But we might start to see the effects in the world of atoms. Battery prices have been falling for years, following Wright’s Law.** Batteries cost around $1,000/kWh in 2010 and have since fallen to around $100/kWh. Solar panel prices have followed the same curve. The cost to decode the human genome has fallen faster than Moore’s Law.* The world of atoms might be at the tipping point of disruption.
Detective Del Spooner: Human beings have dreams. Even dogs have dreams, but not you, you are just a machine. An imitation of life. Can a robot write a symphony? Can a robot turn a … canvas into a beautiful masterpiece?
Robot Sonny: Can you?
—I, Robot (2004)
Using the past as an indicator of the future, this final chapter addresses how artificial intelligence systems might evolve into artificial general intelligence. It explains the difference between knowing that versus knowing how. And given that the brain is a good indicator of how AI systems evolve, we know that for the animal kingdom there is a high correlation of intelligence to the number of pallial and cortical neurons. The same has been true for deep learning. The higher the number of neurons, the more performant a multilayer neural network is. While artificial neural networks still have a few orders of magnitude less neurons than the human brain, we are marching toward that milestone. Finally, we’ll talk about the Singularity, a point where artificial intelligence might be hard to control.
Arthur C. Clarke has an interesting quote where he says, “Any advanced technology is indistinguishable from magic.”* If you were to go back to the 1800s, it would be unthinkable to imagine cars traveling at 100 mph on the highway or living with handheld devices for connecting with people on the other side of the planet.
Since the Dartmouth Conference and the creation of the artificial intelligence field, great strides have been made. The original dream many had of computers, which was to perform any intellectual task better than humans, is much closer than before. Though, some argue that this may never happen or is still in the very distant future.
The past, however, may be a good indication of the future. Software is better than the best humans at playing checkers, chess, Jeopardy!, Atari, Go, and Dota 2. It already performs text translation for a few languages better than the average human. Today, these systems improve the lives of millions of people in areas like transportation, e-commerce, music, media, and many others. Adaptive systems help people drive on highways and streets, preventing accidents.
At first, it may be hard to imagine computer systems performing what once were cerebral tasks like designing and engineering systems or writing a legal brief. But at one time, it was also hard to imagine systems triumphing over the best humans at chess. People claim that robots do not have imagination or will never accomplish tasks that only humans can perform. Others say that computers cannot explain why something happens and will never be able to.
The problem is that for many tasks humans cannot explain why or how something happens, even though they might know how to do it. A child knows that a bicycle has two wheels, its tires have air, and you ride it by pushing the pedals forward in circles. But this information is different than knowing how to ride a bicycle. The first kind of knowledge is usually called “knowing that,” while the skill of riding the bike is “knowing how.”
These two kinds of knowledge are independent of each other, but they might help each other. Knowing that you need to push the pedals forward can help a person ride a bike. But “knowing how” cannot be reduced to “knowing that.” Knowing how to ride a bike does not imply that you understand how it works. In the same way, computers and humans perform different tasks that require them to know how to do it but not “know that.” Many rules apply to the pronunciation of certain words in English. People know how to pronounce the words, but they cannot explain why. A person who has access to a Chinese dictionary may actually understand Chinese with the help of that resource. Computers, in the same way, perform tasks and may not be able to explain the details. Asking why computers do what they do might be the same as asking why someone swings a bat the way they do when playing baseball.
It is hard to predict how everything will play out in the future and what will come next. But looking at the advances of the different subfields of artificial intelligence and their performance over time may be the best predictor of what might be possible in the future. Given that, let’s look at the advances in the different fields of AI and how they stack up. From natural language processing and speech recognition to computer vision, systems are improving linearly, with no signs of stopping.
Figure: AI advances on different benchmarks over time.* First Image: Top-5 accuracy asks whether the correct label is in at least the classifier’s top five predictions. It shows that the error rate has improved from around 85% in 2013 to almost 99% in 2020. Second Image: CityScapes Challenge. Cityscapes is a large-scale dataset of diverse urban street scenes across 50 different cities recorded during the daytime. This task requires an algorithm to predict the per-pixel semantic labeling of the image. Third Image: SuperGLUE Benchmark. SuperGLUE is a single-metric benchmark that evaluates the performance of a model on a series of language understanding tasks on established datasets. Fourth Image: Visual Question Answering Challenge: Accuracy. The VQA challenge, introduced in 2015, requires machines to provide an accurate natural language answer, given an image and a natural language question about the image based on a public dataset.
Algorithms can only solve problems like self-driving cars and winning Go games if they have the correct data. For these algorithms to exist, it is essential to have properly labeled data. In research circles, significant efforts are underway to reduce the size of the datasets needed to create the appropriate algorithms, but even with this work, there is still a need for large datasets.
Figure: Dataset size comparison with the number of seconds that a human lives from birth to college graduation.
Datasets are already comparable in size to what humans capture during their lifetime. The figure above compares the size of the datasets used to train computers to the number of seconds from birth to college graduation of a human on a logarithmic scale. One of the datasets in the figure is Fei-Fei Li’s ImageNet described earlier in this book. The last dataset in the picture is used by Google to create their model for understanding street numbers on the façades of houses and buildings.
There is an entire field of research studies on how to combine machine learning models with how humans can fix and change labeled data. But it is clear that the amount of data that we can capture in our datasets is already equivalent to what humans do over their lifetime.
But machine learning software does not depend solely on data. Another piece of the puzzle is computational power. One way of analyzing the computational power of neural networks deployed today versus what human brains use is to look at the size of the neural networks in these models. The figure below compares them on a logarithmic scale.
Figure: Comparison of the model size of neural networks and the number of neurons and connections of animals and humans.
Neural networks shown in this figure were used to detect and transcribe images for self-driving cars. The figure below compares the scale of both the number of neurons and the connections per neuron. Both are important factors for neural network performance. Artificial neural networks are still orders of magnitude away from the size of the human brain, but they are starting to become competitive to some mammals.*
Figure: 122 years of Moore’s Law: Calculations per second per constant dollar. This is an exponential/log scale, so a straight line is an exponential; each y-axis tick is 100x. This graph covers a 10,000,000,000,000,000,000x improvement in computation/$.
The price of computation has declined over time, and the incremental computation power available to society has increased. The amount of computing power one can get for every dollar spent has been increasing exponentially. In fact, in an earlier section, I showed that the amount of compute used in the largest AI training runs has been doubling every 3.5 months. Some argue that computing power cannot continue this trend due to physics constraints. Past trends, however, do not support this theory. Money and resources in the area have increased over time as well. More and more people work in the field, developing better algorithms and hardware. And, we know the power of the human brain has a limit that can be achieved because it satisfies the constraints of physics.
With more computing power and improved software, it may be that AI systems eventually surpass human intelligence. The point at which these systems become smarter and more capable than humans is called the Singularity. For every task, these systems will be better than humans. When computers outperform humans, some people argue that they can then continue to become better and better. In other words, if we make them as smart as us, there is no reason to believe that they cannot make themselves better, in a spiral of ever-improving machines, resulting in superintelligence.
Some predict that the Singularity will come as soon as 2045. Nick Bostrom and Vincent C. Müller conducted a survey of hundreds of AI experts at a series of conferences and asked by what year the Singularity (or human-level machine intelligence) will happen with a 10% chance, 50% chance, and 90% chance. The responses were the following:
Median optimistic year (10% likelihood): 2022
Median realistic year (50% likelihood): 2040
Median pessimistic year (90% likelihood): 2075*
So, that means that AI experts believe there is a good chance that machines will be as smart as humans in around 20 years.
This is a controversial topic, as there are experts, including John Carmack, who believe that we will start to have signs of AGI in a decade from now.* But others, such as Kevin Kelly, argue that believing that there will be an “Artificial General Intelligence” is a myth.* Either way, if the pessimistic timetable for achieving it is any indication, we will know by the end of the century whether it is starting to materialize.
If the Singularity is as near as many predict and it results in artificial general intelligence that surpasses human intelligence, the consequences are unthinkable to society as we now know it. Imagine that dogs created humans. Would dogs understand the result of creating such creatures in their lives? I doubt it. In the same way, humans are unlikely to understand this level of intelligence, even if we initially created it.
controversyOptimists argue that because of the surge of the Singularity, solutions to problems previously deemed impossible will soon be obvious, and this superintelligence will solve many societal problems, such as mortality. Pessimists, however, say that as soon as we achieve superintelligence, then human society as we know it will become extinct. There would be no reason for humans to exist. The truth is that it is hard to predict what will come after the creation of such technology, though many agree that it is near.
This chapter reflects recent developments and was last updated in October of 2022.
This landscape of the top artificial intelligence teams aims to track the most prominent teams developing products and tools in each of several areas. Tracking these teams gives a good starting point of the activity of where future development will be.
2022 has seen remarkable tools being developed by top teams, including some, especially DALL-E 2, that are so impressive they have gone viral. This builds on high-profile tools released to the public in the past two years, including GPT-3 in 2020 and GitHub CoPilot (based on GPT-3) in June 2021, which now enjoys widespread use by almost 2 million developers.
In 2022, we continue to see the growing size of neural networks, even though there hasn’t been a new development in neural networks as game-changing as Transformers in 2017. In 2021, Microsoft released a 135 billion parameter neural network model, and at the end of 2021, Nvidia together with Microsoft released an even larger model, called Megatron-Turing NLG, with 530 billion parameters. There is no reason to believe that the growth will stop any time soon. We haven’t seen a model of headline-grabbing size as of July 2022, but that could change by the end of the year.
Credit: Generated with DALL-E 2.
2022 has seen remarkable tools being developed by top teams. In April 2022, DALL-E 2, an AI system that can create realistic images and art from a description in natural language, was released. It took the world by storm. This builds on high-profile tools released to the public in the past two years, including GPT-3 in 2020 and GitHub CoPilot (based on GPT-3) in June 2021, which now enjoys widespread use by almost 2 million developers.
Moreover, these tools have been adopted faster and faster. It took around 2 years for GPT-3 to gather 1 million signups. GitHub CoPilot took around 6 months, and DALL-E 2 only 2.5 months to hit the same milestone.
The capabilities of AI systems are improving steadily and predictably. For example, it is not a surprise to anyone following the industry to see DALL-E 2 come to life—it was a natural evolution of generative capabilities for images, the pieces had been built, and the quality of image generation has been improving steadily.
A rapidly expanding number of companies have formed to help machine learning engineers. In the landscape of top artificial intelligence companies, there are more than 20 companies serving developers. That reflects growth in each area. As these tools mature, it’s likely we will see more consolidation in machine learning developer tooling.
One of the most important companies in the latest round of innovation is Hugging Face. Originally started in 2016 by two French entrepreneurs, Clément Delangue and Julien Chaumond, Hugging Face provides the tools that engineers need to create new natural language processing (NLP) services. It serves research labs such as Meta AI and Google AI and companies like Grammarly and Microsoft. Just as we are seeing Hugging Face transform NLP, we can expect to see companies emerge on top of newer image generation tools. Four years ago, NLP was in a similar state to image generation tools right now.
Hugging Face is also leading the BigScience Research Workshop, a one-year long research workshop on large multilingual models and datasets. During one year, 900 researchers from 60 countries and more than 250 institutions created a very large multilingual neural network language model and text dataset on the 28 petaflop Jean Zay (IDRIS) supercomputer located near Paris, France. It is all open source. The model finished training in June 2022, and it will be available to the public. This is the first massive project of its kind where the model is openly available to the public.
Some of the latest models are being offered as a service to other companies. An important example of what is coming is what Replit is doing by using OpenAI’s APIs: rolling out a tool that explains code with natural language and a tool to help fix buggy code before it is deployed. Similarly, Hugging Face hosts the state-of-the-art models that other companies and research labs can use.
This reflects a transition from developer APIs that require developers to build their own models to ready-to-use models, which will unleash capabilities rapidly, as powerful models with APIs are integrated into many products—“develop once, deploy everywhere.” The faster adoption will be both by end users and by companies. As these teams sell their tools to companies, they will have a bigger base to sell newer tools. If the past cycle of ML tools product development was defined by software-as-a-service, this cycle is seeing the emergence of models-as-a-service.
There has been an explosion of generative media companies addressing a range of applications, from writing text to creating personalized videos. One of the most popular new tools is Lex, which helps writers be more productive. There is an explosion of tools, especially helping consumers write better marketing copy. There are at least nine companies tackling marketing copy: Anyword, Copysmith, Writesonic, Hypotenuse AI, Jasper, Copy.ai, Peppertype, Regie.ai, and Contenda. Generative media is currently making rapid progress on text and images, but as the algorithms improve, video will become a bigger focus. There are a few players in the ecosystem helping create generative media, including big companies, startups, artists, and chip makers. It is not clear which group will capture most of the value of these models. Generative media also enables a new kind of artist to emerge. “Prompt engineering,” which is the technique for which these artists use to manipulate the prompts that produce images, audio, and videos, allows artists to become more productive and unlock their creative minds. However, there is now an emerging legal debate around the copyright for AI-generated media.
Image editing is used everywhere in the creator economy, and better editing capabilities are in constant demand. Facet.ai exemplifies this trend: users segment their images with a one-click tool, apply style-transfer from other images, and also easily apply the same style to other images.
Lightricks is another company focused on mobile first. One of its apps, Facetune, helps users edit and modify their selfies with simple gestures. Another company building editing tools is toolbox is Topaz Labs, which helps upscale or denoise images or videos.
Audio processing and creation has seen massive improvements in the past two years or so. AI assistants like Siri, Alexa and Google Assistant, made the average consumer aware and used to the fact that you can interact with AI through speech.
We are now seeing the creation of new tools that help people communicate and edit audio almost as easily as text. The leader in the field is Descript. It helps users remove unwanted audio through its text editing tool. It also allows users to create new clips of audio with the user’s own voice with its text-to-speech models. Some companies like Krisp are helping users remove unwanted background noise.
Intelligent video editing requires significantly greater compute capabilities compared to image and audio editing, so is not as advanced yet.
Tools like RunwayML help editors modify videos by adding new styles or segmenting them automatically. It can remove subjects with a simple brush stroke. Going further, Synthesia can create fully synthetic tutorial videos out of text. We are seeing only the very beginning of these capabilities. We could well begin to see AI-generated or AI-enhanced influencers in a few years. (Do any old TV fans remember Max Headroom?)
Self-driving companies were heavily hyped about five years ago, when it seemed like every other week another company was raising yet more venture funding. Then there was significant consolidation in the market as some failed and others were sold. Now, 2022 is the year that a lot of the successful efforts are deploying their system to the general public. For example, Cruise is now being paid for their rides in San Francisco. Comma.ai is profitable and sells a device that connects to cars to help them navigate on highways.
But the most prominent success in self-driving technology is Tesla. It is selling almost one million cars per year and is increasing its production 50% year-over-year. But included with its cars is self-driving car software. Tesla releases its safety numbers. It has consistently increased safety for drivers, though with broader use, both accidents and regulatory scrutiny are growing: 873,000 vehicles now have Tesla Autopilot and these were involved in 273 accidents last year.
In the past year Tesla displayed a move towards usage of neural networks in every area of its stack: from perception to prediction. This year, Elon Musk stated in a TED talk that Tesla self-driving cars will be better than humans by the end of the year. His predictions about self-driving technology have been over-optimistic in the past, but the progress is undeniable.
The leader in self-flying drones is Skydio, which first released its drone with object avoidance in 2018. Now all major drone companies offer it. Self-flying drones are now the standard. Companies are now starting to offer smaller drones. Snapchat has recently launched a new small drone called Pixy to help users take selfies. They are much more lightweight and pack in fewer features compared to full-fledged drones.
The ever-increasing computational demands of bigger and bigger neural network models means more companies are specializing in creating chips for these workloads. The most prominent one is Cerebras, which announced their waffle-sized chip in 2019. The problem is that there is a need for software to translate the neural network code into what is implemented in the hardware. That is a big leg up that Nvidia has compared to competitors.
More companies, including Meta and Tesla, are announcing big clusters of GPUs or AI chips to train their models that have been increasing at an exponential rate. This year, Meta AI announced the AI Research SuperCluster with 16,000 GPUs. This is the fastest AI supercomputer in the world. In 2021, Tesla also announced its supercomputer called Dojo, which when announced had around 5,000 GPUs, and it is increasing the amount of GPUs over time. They are also developing their own chips for training neural networks.
The amount of compute for machine learning and AI tools will only increase over time. ARK Invest predicts that AI-relative compute unit production costs could decline at a 39% annual rate and that software improvements could contribute an additional 37% in cost declines during the next eight years. That means that the GPT-3 model, which cost around 12 million dollars when created, would cost a few hundred dollars in 2030. In short, there is a lot of room for new entrants and competition for hardware to power AI applications.
Predictive coding has stalled in its current form. Yann LeCun has been on a quest to create a model that is better at predictions, beginning with a specific proposal: to predict the next frames of a video. However, this work hasn’t had significant breakthroughs yet.
LeCun is now working on a new architecture with six separate modules to try to break through this wall. The hope is to step up deep learning algorithms in ways that go beyond simply increasing the size of neural networks. The six modules are: the configurator, the perception module, the world model module, the cost module, the actor module, and the short-term memory module. These modules were created based on how the human brain works.
This is a novel idea and it will take researchers some time to figure out if this new approach with multiple modules works.
Beyond the continuous movement toward faster computation and bigger neural nets, there seems to be another significant trend: more tools and models are becoming reusable in open source and as APIs, and more money is flowing into key groups that fund these flexible and reusable tools and models. So more combinations of tools are possible more quickly than ever before—a sort of “network effect” for ML tooling and models. This means we can realistically expect unusually rapid adoption of neural nets in more and more software.
Given the increase of computation and the exciting new models that were released in the past few years, including GPT-3, DALL-E 2, Github CoPilot, and LaMDA (Google’s powerful new language model for conversational applications) we continue to see rapid advancements in software for text, image, video, and audio understanding and generation. In spite of the viral, intelligent-sounding chat shared by one Google employee, the system is only synthesizing human-like responses to questions and it does not make sense to say that Google LaMDA, or any other neural network model, is sentient.
Coupled with the trend toward models-as-a-service, we are likely to see these features embedded in many more products quickly. While specific concerns like deep fakes and “sentient” chatbots will continue to grab headlines, it seems much more likely that more and more machine learning-powered features will appear simply as highly useful features embedded in the products we use every day.
The resources here are a small subset of the full set of resources available on the web, selected for their breadth, notability, and depth on specific issues.
François Challot, Deep Learning with Python, 2017
Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 2020
Jeremy Howard, Practical Deep Learning for Coders with fastai and PyTorch, 2020
John Brockman, Possible Minds: Twenty-Five Ways of Looking at AI, 2019
Pedro Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, 2015
Nick Bostrom, SuperIntelligence: Paths, Danger, Strategies by Nick Bostrom, 2014
John Brockman, What to Think about Machines that Think: Today’s Learning, 2015
Stanford’s Machine Learning Course by Andrew Ng
MIT’s Machine Learning with Python: from Linear Models to Deep Learning
Udacity’s Self-Driving Car Engineer Nanodegree
Google AI’s Machine Learning Crash Course
Stanford’s AI Index Report, 2022
State of AI Report, 2021
Shivon Zillis’ The Current State of Machine Intelligence, 2016
This list is far from complete but offers information about some of the people mentioned in this book.
Greg Brockman is the co-founder and CTO of OpenAI. He was formerly the CTO of Stripe.
Rodney Brooks is the co-founder of iRobot and Rethink Robotics, and former director of the MIT Computer Science and Artificial Intelligence Laboratory.
Adam Cheyer is an AI researcher and the developer of Siri. He has co-founded several startups, including Siri, Inc.
Douglas Engelbart was a researcher and inventor. He is the creator of the field of human-computer interaction (HCI). He is best known for the creation of the computer mouse, but he is also known for demonstrating the power and the potential of the computer in the information age.
David Ferrucci was the leading scientist of IBM Watson from 2007 until 2011.
Ian Goodfellow is a director of machine learning in the Special Projects Group at Apple. He is also one of the inventors of generative adversarial networks.
Demis Hassabis is a researcher, neuroscientist and the CEO and co-founder of DeepMind.
Geoffrey Hinton is a British-Canadian computer scientist, most famous for his work on artificial intelligence and most specifically for ushering in the era of deep learning.
George Hotz is the founder of Comma.ai. He is well-known for developing iOS jailbreaks and reverse engineering the PlayStation 3.
Garry Kasparov is a former World Chess Champion, who played IBM supercomputer Deep Blue.
Yann LeCun is the VP and Chief AI Scientist at Meta and a Professor of Mathematical Sciences at New York University. He is considered the founder of convolutional neural networks.
Fei-Fei Li is the Co-Director of the Stanford Institute for Human-Centered Artificial Intelligence, and a Co-Director of the Stanford Vision and Learning Lab. She is the creator of ImageNet.
John McCarthy was a computer scientist. He was one of the founders of artificial intelligence and designed the Lisp programming language.
Donald Michie was a researcher in artificial intelligence. He worked with Alan Turing at Bletchley Park and sketched designs for developing a system to play a chess game.
Marvin Minsky was an American computer scientist and one of the first pioneers in artificial intelligence. He co-founded the Massachusetts Institute of Technology’s Artificial Intelligence Laboratory.
Andrew Ng is a computer scientist and technology entrepreneur focusing on machine learning. He was head of Google Brain, and built Google’s Artificial Intelligence Group.
Judea Pearl is a computer scientist. He is best known for his work in probabilistic reasoning, and more specifically in the development of Bayesian networks.
Nathaniel Rochester was a computer scientist and worked at IBM, producing some of the first AI programs.
Frank Rosenblatt was a psychologist notable in the field of artificial intelligence, who implemented an early demonstration of a neural network.
Arthur Samuel was one of the leading figures in artificial intelligence. He popularized the term “machine learning” in 1959.
Roger Schank is an artificial intelligence theorist, most known for the creation of SAM, an acronym for Script Applier Mechanism.
John Searle is a philosophy professor at the University of California, Berkeley. He is most known for the creation of the Chinese room argument.
James Slagle is a computer scientist. He is most known for his achievements in artificial intelligence, including the development of the Symbolic Automatic Integrator or SAINT.
Stefanie Tellex is an Associate Professor of Computer Science at Brown University and creator of the Million Object Challenge.
Sebastian Thrun is a computer scientist and entrepreneur. He is the CEO of Kitty Hawk Corporation, and co-founder of Udacity.
Alan Turing was a mathematician, logician, philosopher, and computer scientist. He is widely considered to have founded the fields of theoretical computer science and artificial intelligence.
Joseph Weizenbaum was a computer scientist and a professor at MIT, and the creator of the first chatbot, ELIZA.
Amazon’s Alexa is a virtual assistant created by Amazon.
AlphaFold is a machine learning algorithm developed by DeepMind that predicts 3D protein structures based on only the DNA.
DARPA Grand Challenge was a competition for autonomous vehicles, that was funded by the Defense Advanced Research Projects Agency.
Google X is a research and development organization inside Google that is focused in what it calls moonshots.
Lisp is the first programming language optimized for artificial intelligence. It was created by John McCarthy in 1958.
Prolog is a logic programming language first developed in 1972 that is associated with artificial intelligence and computational linguistics.
ELIZA was the first version of a chatbot, created at the MIT Artificial Intelligence Laboratory.
Deep Thought was a computer developed to play chess at Carnegie Mellon University and later at IBM.
Deep Blue was an expert system run to play chess built by IBM. It was the first computer program to win a match against a chess Grandmaster.
Watson was the first computer program that beat the best Jeopardy! players, and was developed by IBM.
DARPA (or the Defense Advanced Research Projects Agency) is a research and development agency in the United States responsible for developing technologies for the military.
A graphical processing unit is a chip or electronic circuit that is specialized for rendering graphics on an electronic device.
A tensor processing unit is a chip developed by Google as an AI accelerator specifically for neural network machines.
ImageNet is an image database organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.
TensorFlow is an open-source software library for machine learning. It has a particular focus on training and inference of deep neural networks.
Mechanical Turk is a crowdsourcing website developed by Amazon for businesses to hire remotely located workers to perform discrete on-demand tasks that computers are currently unable to do.