A Brief History of AI

2 hours, 51 links


Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

The advancement of artificial intelligence (AI) has not been a straight path—there have been periods of booms and busts. This first section discusses each of these eras in detail, starting with Alan Turing and the initial development of artificial intelligence at Bletchley Park in England, and continuing to the rise of deep learning.

The 1930s to the early 1950s saw the development of the Turing machine and the Turing test, which were fundamental in the early history of AI. The official birth of artificial intelligence was in the mid-1950s with the onset of the field of computer science and the creation of machine learning. The year 1956 ushered in the golden years of AI with Marvin Minsky’s Micro-Worlds.

For eight years, AI experienced a boom in funding and growth in university labs. Unfortunately, the government, as well as the public, became disenchanted with the lack of progress. While producing solid work, those in the field had overpromised and underdelivered. From 1974 to 1980, funding almost completely dried up, especially from the government. There was much criticism during this period, and some of the negative press came from AI researchers themselves.

In the 1980s, computer hardware was transitioning from mainframes to personal computers, and with this change, companies around the world adopted expert systems. Money flooded back into AI. The downside to expert systems was that they required a lot of data, and in the 1980s, storage was expensive. As a result, most corporations could not afford the cost of AI systems, so the field experienced its second bust. The rise of probabilistic reasoning ends the first section of Making Things Think at around the year 2001.

The Early Beginnings of AI (1932–1952)

I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.Alan Turing*

This chapter covers Alan Turing, the initial developments of artificial intelligence at Bletchley Park in England, and how they helped break Germany’s codes and win World War II. I also describe the development of the Turing machine and Turing test, which became the golden test for testing artificial intelligence systems for decades. We’ll also meet Arthur Samuel and Donald Michie, who started early developments in artificial intelligence systems and created engines for systems to play games.

Alan Turing

During the Second World War, the British and the Allies had the help of thousands of codebreakers located at Bletchley Park in the UK. In 1939, one of these sleuths, Alan Turing, a young mathematician and computer scientist, was responsible for the design of the electromechanical machine named the Bombe. The British used this device to break the German Enigma Cipher.* At the same location in 1943, Tommy Flowers, with contributions from Turing, designed the Colossus, a set of computers built with vacuum tubes, to help the Allies crack the Lorenz Cipher.* These two devices helped break the German codes and predict Germany’s strategy. According to US General Eisenhower, cracking the enemy codes was decisive for the Allies winning the war.

These events marked the initial development of artificial intelligence. In their free time during the war, Turing and Donald Michie, a cryptographer recruited to Bletchley Park, had a weekly chess game. While playing, they talked about how to write a computer program that would play against human opponents and beat them. They sketched their designs with pen and paper. Unfortunately, they never went ahead and coded their program. At the time, the state-of-the-art computer was the Atanasoff-Berry computer, designed to only solve linear equations. It would have been very hard for the pair to code a program that could beat humans using such computers. However, these meetings contributed to the early beginnings of the artificial intelligence field. Because of his work during and after the war, Turing became known as the father of theoretical computer science and artificial intelligence.

Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Now Available

But when the war ended, the group that once worked together at Bletchley Park parted ways. Turing, however, did not stop his research; he continued in the computer field. He had already made a name for himself before the war with his seminal 1936 paper* on computing, explaining how machines like computers worked. This mathematical structure became the basis for modeling computers and was later named the Turing machine. Between 1945 and 1947, Turing designed the Automatic Computing Engine (ACE), an early electronic stored-program computer, at the National Physical Laboratory. He continued pursuing the idea of writing a chess program and worked on the theoretical framework for doing so. By 1948, he, working with David Champernowne, a former undergraduate colleague, began coding the program even though no computer at the time could run it. By 1950, he had finished Turochamp.

In 1952, he tried to implement Turochamp on a Ferranti Mark 1, the first commercially available general-purpose electronic computer. But the machine lacked enough computing power to execute Turochamp. Instead, Turing ran the computer program by hand, flipping through the pages of the algorithm. This exercise marked the first demonstration of a working artificial intelligence system. It would take 45 more years for a computer program to win against a chess world champion. The humble beginnings of AI started with Turing’s work.

With the rapid development in computing and AI, Turing wrote about the future of the field in his 1950 seminal paper “Computing Machinery and Intelligence.”* He predicted that by the 2000s, society’s opinion regarding artificial intelligence would shift completely due to technological advances. His prediction, in some ways, turned out to be correct.

Neural Networks’ Early Days

Neural networks are computer systems that are modeled (more or less loosely) on how neurons in the human brain function and interact.

By 1945, Turing was already thinking about how to simulate the human brain with a computer. His Automatic Computing Engine created models of how the brain worked. In a letter to a coworker, he wrote, “I am more interested in the possibility of producing models of the action of the brain than in the practical applications to computing … although the brain may in fact operate by changing its neuron circuits by the growth of axons and dendrites, we could nevertheless make a model, within the ACE, in which this possibility was allowed for, but in which the actual construction of the ACE did not alter, but only the remembered data …”*

In 1948, Turing defined two types of unorganized machines, which would be the first computer models of brains and become the basis of neural networks. He based one on how transistors work and the other on how neural networks would eventually be modeled. Around the same time, he also defined genetic search to configure his unorganized machines by searching for the best model of a neural network for a given task.

The Imitation Game

Figure: Alan Turing, who founded the fields of theoretical computer science and artificial intelligence.

The Turing test is a game where players try to guess which of two participants is a computer. The evaluators are only aware that one of the two participants is a computer. The conversation uses text-only communication like a computer screen. If the judges cannot reliably tell the machine from the human, then the computer passes the test and can be said to exhibit human-level intelligence.

Alan Turing defined the Imitation Game in 1950; it later became more commonly known as the Turing test and became the golden test for figuring out if a computer exhibits the same intelligence as a human. In the party game that inspired the Imitation Game, a man and a woman occupy different rooms, and the onlookers try to guess who is in which room by reading their typewritten responses to questions. The contestants answer in a way that tries to convince the judges that they are the other person. In the Turing test, instead of a man and a woman, the interaction happens between a human and a computer.

Figure: The Turing test. During the Turing test, the human questioner asks a series of questions to both respondents. After the specified time, the questioner tries to decide which terminal is operated by the human respondent and which terminal is operated by the computer.

The First Game-Playing Computer Program

After the war, Michie, Turing’s friend and fellow codebreaker, became a senior lecturer in surgical science at the University of Edinburgh. Even though his day job was not related to AI, he continued working on the development of artificial intelligence systems, especially games.

Michie did not have access to a digital computer because they were too costly at the time. While many hurdles existed, he developed a program to play a perfect tic-tac-toe game with 304 matchboxes, each representing a unique board state.* Michie’s machine not only played tic-tac-toe but was also able to improve on its own over time—learning how to better play the game.

The Birth of Machine Learning

In 1949, a coder named Arthur Samuel, an expert on vacuum tubes and transistors at IBM, brought IBM’s first commercial general-purpose digital computers to the market. On the side, he worked to implement a program that, by 1952, could play checkers against a human opponent. It was the first artificial intelligence program to be written and run in the United States. He worked tirelessly, and in 1956, he demonstrated it to the public. Samuel improved the underlying software by hand, and when he had access to a computer, he made changes there.

As time passed, he started wondering if the machine could make the same improvements by itself, instead of him having to write the rules for the program by hand. He pondered whether the device could do all the fine-tuning itself. With this idea in mind, he published a paper titled “Some Studies in Machine Learning Using the Game of Checkers.”*

Machine learning is the process in which a machine learns the variables of a problem and fine tunes them on its own instead of humans hard-coding the rules for reaching the solution.

Samuel’s publication marked the birth of machine learning. One of the two learning techniques that Samuel described in his paper was called rote learning. Today, this technique is known as memoization, a computer science strategy used to speed up computer programs. The other method involved measuring how good or bad a specific board position was for the computer or its human opponent. By improving the measurement of a board state, the program could become better at playing the game. In 1961, Samuel’s program beat the Connecticut state checker champion. It was the first time that a machine trumped a player in a state competition, a pattern that would repeat in the years to come.

The Birth of Artificial Intelligence (1952–1956)

If after I die, people want to write my biography, there is nothing simpler. They only need two dates: the date of my birth and the date of my death. Between one and another, every day is mine.Fernando Pessoa*

The birth of artificial intelligence was seen with the initial development of neural networks including Frank Rosenblatt’s creation of the perceptron model and the first demonstration of supervised learning. That led to the Georgetown-IBM experiment, an early language translation system. Finally, the end of the beginning was marked by the Dartmouth Conference, at which artificial intelligence was officially launched as a field in computer science, leading to the first government funding of AI.

Neural Networks

In 1943, Warren S. McCulloch, a neurophysiologist, and Walter Pitts, a mathematical prodigy, created the concept of artificial neural networks. They designed their system based on how our brains work and patterned it after the biological model of how neurons—brain cells—work with each other. Neurons interact with their extremities, firing signals via their axon across a synapse to neighboring neurons’ dendrites. Depending on the voltage of this electrical charge, the receiving neuron proceeds to either fire a new charge of electrical pulse to the next set of neurons, or not.

Figure: Artificial neural networks are based on the simple principle of electrical charges and how they are passed in the brain.

The hard part of modeling the correct artificial neural network, that is, one that achieves the task that you are trying to solve, is that you need to figure out what voltage one neuron should pass to another as well as what it takes for a neuron to fire.

Both the voltages and the firing criteria become variables that need to be determined for the model. In an artificial neural network, the voltage that is passed from neuron to neuron is called a weight. These weights need to be trained so that the artificial neural network performs the task at hand. One of the earliest ways to do this is called Hebbian learning, which we’ll talk about next.

Hebbian Learning

In 1947, around the same time that Arthur Samuel was working on the first computer that would beat a state checker champion, Donald Hebb, a Canadian psychologist with a PhD from Harvard University, became a Professor of Psychology at McGill University. Hebb would later be the first to develop the idea of neural networks.

In 1949, Hebb developed a theory known as Hebbian learning, which proposes an explanation for how our neurons fire and change when we learn something new. It states that when one neuron fires to another, the connection between them develops or enlarges. That means that whenever two neurons are active together, because of some sensory input or other reason, these neurons tend to become associated.

Therefore, the connections among neurons become stronger or grow when the neurons fire together, making the link between the two neurons harder to break. Hebb explained how that is the way humans learn. Hebbian learning, the process of making connections stronger between neurons that fire together, was the way to create artificial neural networks early on, but later, other techniques became more predominant.

The way this network of neurons become associated with a memory or some pattern that causes all these neurons to fire together became known as an engram. Gordon Allport defines engrams as, “If the inputs to a system cause the same pattern of activity to occur repeatedly, the set of active elements constituting that pattern will become increasingly strongly inter-associated. That is, each element will tend to turn on every other element and (with negative weights) to turn off the elements that do not form part of the pattern. To put it another way, the pattern as a whole will become ‘auto-associated.’ We may call a learned (auto-associated) pattern an engram.”*

Early Demonstrations

With these models in mind, in the summer of 1951, Marvin Minsky, together with two other scientists, developed the Stochastic Neural Analog Reinforcement Calculator (SNARC)—a machine with a randomly connected neural network of approximately 40 artificial neurons.* The SNARC was built to try and find the exit from a maze in which the machine played the part of the rat.

Minsky, with the help of an American psychologist from Harvard, George Miller, developed the neural network out of vacuum tubes and motors. The machine first proceeded randomly, then the correct choices were reinforced by making it easier for the machine to make those choices again, thus increasing their probability compared to other paths. The device worked and made the imaginary rat find a path to the exit. It turned out that, by an electronic accident, they could simulate two or three rats in the maze at the same time. And, they all found the exit.

Minsky thought that if he “could build a big enough network, with enough memory loops, it might get lucky and acquire the ability to envision things in its head.”* In 1954, Minsky published his PhD thesis, presenting a mathematical model of neural networks and its application to the brain-model problem.*

This work inspired young students to pursue a similar idea. They sent him letters asking why he did not build a nervous system based on neurons to simulate human intelligence. Minsky figured that this was either a bad idea or would take thousands or millions of neurons to make work.* And at the time, he could not afford to attempt building a machine like that.


In 1956, Frank Rosenblatt implemented an early demonstration of a neural network that could learn how to sort simple images into categories, like triangles and squares.*

Figure: Frank Rosenblatt* and an image with 20x20 pixels.

He built a computer with eight simulated neurons, made from motors and dials, connected to 400 light detectors. Each of the neurons received a set of signals from the light detectors and spat out either a 0 or 1 depending on what those signals added up to.

Rosenblatt used a method called supervised learning, which is a way of saying that the data that the software looks at also has information identifying what type of data it is. For example, if you want to classify images of apples, the software would be shown photos of apples together with the tag “apple.” This approach is much like how toddlers learn basic images.

Figure: The Mark I Perceptron.

Perceptron is a supervised learning algorithm for binary classifiers. Binary classifiers are functions that determine if an input, which can be a vector of numbers, is part of a class.

The perceptron algorithm was first implemented on the Mark I Perceptron. It was connected to a camera that used a 20x20 grid of cadmium sulfide* photocells* producing a 400-pixel image. Different combinations of input features could be experimented with using a patchboard. The array of potentiometers on the right* implemented the adaptive weights.*

Rosenblatt’s perceptrons classified images into different categories: triangles, squares, or circles. The New York Times featured his work with the headline “Electronic ‘Brain’ Teaches Itself.”* His work established the principles of neural networks. Rosenblatt predicted that perceptrons would soon be capable of feats like greeting people by name. The problem is, however, that his algorithm did not work with multiple layers of neurons due to the exponential nature of the learning algorithm: it required too much time for perceptrons to converge to what engineers wanted them to learn. This was eventually solved, years later, by a new algorithm called backpropagation, which we’ll cover in the section on deep learning.

A multilayer neural network consists of three or more layers of artificial neurons—an input layer, an output layer, and at least one hidden layer—arranged so that the output of one layer becomes the input of the next layer.

Figure: A multilayer neural network.

The Georgetown-IBM Experiment

The Georgetown-IBM experiment translated English sentences into Russian and back into English. This demonstration of machine translation happened in 1954 to attract not only public interest but also funding.* This system specialized in organic chemistry and was quite limited, with only six grammar rules. An IBM 701 mainframe computer, designed by Nathaniel Rochester and launched in April 1953, ran the experiment.*

A feature article in the New York Times read, “A public demonstration of what is believed to be the first successful use of a machine to translate meaningful texts from one language to another took place here yesterday afternoon. This may be the cumulation of centuries of search by scholars for a mechanical translator.”

Figure: The Georgetown-IBM experiment translated 250 sentences from English to Russian.

The demo worked in some cases, but it failed for most of the sentences. A way of verifying if the machine translated a phrase correctly was to translate it from English to Russian and then back into English. If the sentence had the same meaning or was similar to the original, then the translation worked. But in the experiment, many sentences ended up different from the original and with an entirely new meaning. For example, given the original sentence “The spirit is willing, but the flesh is weak,” the result was “The whiskey is strong, but the meat is rotten.”

The system simply could not understand the meaning, or semantics, of the sentence, making mistakes in translation as a result. The errors mounted, completely losing the original message.

The Dartmouth Conference

AI was defined as a field of research in computer science in a conference at Dartmouth College in the summer of 1956. Marvin Minsky, John McCarthy, Claude Shannon, and Nathaniel Rochester organized the conference. They would become known as the “founding fathers” of artificial intelligence.

At the conference, these researchers wrote a proposal to the US government for funding. They divided the field into six subfields of interest: computers, natural language processing, neural networks, theory of computation, abstraction, and creativity.

From left to right: Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge, and Ray Solomonoff.

At the conference, many predicted that a machine as intelligent as a human being would exist in no more than a generation, about 25 years. As you know, that was an overestimation of how quickly development of artificial intelligence would proceed. The workshop lasted six weeks and started the funding boom into AI, which continued for 16 years until what would be called the First AI Winter.

The Defense Advanced Research Projects Agency (DARPA) poured most of the money that went into the field during the period known as the Golden Years in artificial intelligence.

During this “golden” period, the early AI pioneers set out to teach computers to do the same complicated mental tasks that humans do, breaking them into five subfields: reasoning, knowledge representation, planning, natural language processing (NLP), and perception.

These general-sounding terms do have specific technical meanings, still in use today:

  • Reasoning. When humans are presented with a problem, we can work through a solution using reasoning. This area involved all the tasks involved in that process. Examples include playing chess, solving algebra problems, proving geometry theorems, and diagnosing diseases.

  • Knowledge representation. In order to solve problems, hold conversations, and understand people, computers must have knowledge about the real world, and that knowledge must be represented in the computer somehow. What are objects, what are people? What is speech? Specific computer languages were invented for the purpose of programming these things into the computer, with Lisp being the most famous. The engineers building Siri had to solve this problem for it to respond to requests.

  • Planning. Robots must be able to navigate in the world we live in, and that takes planning. Computers must figure out, for example, how to move from point A to point B, how to understand what a door is, and where it is safe to go. This problem is critical for self-driving cars so they can drive around roads.

  • Natural language processing. Speaking and understanding a language, and forming and understanding sentences are skills needed for machines to communicate with humans. The Georgetown-IBM experiment was an early demonstration of work in this area.

  • Perception. To interact with the world, computers must be able to perceive it, that is, they need to be able to see, hear, and feel things. Sight was one of the first tasks that computer scientists tackled. The Rosenblatt perceptron was the first system to address such a problem.

The Golden Years of AI (1956–1974)

The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.Edsger Dijkstra

The Golden Years of AI started with the development of Micro-Worlds by Marvin Minsky as well as John McCarthy’s development of Lisp, the first programming language optimized for artificial intelligence. This era was marked by the creation of the first chatbot, ELIZA, and Shakey, the first robot to move around on its own.

The years after the Dartmouth Conference were an era of discovery. The programs developed during this time were, to most people, simply astonishing. The next 18 years, from 1956 to 1974, were known as the Golden Years.* Most of the work developed in this era was done inside laboratories in universities across the United States. These years marked the development of the important AI labs at the Massachusetts Institute of Technology (MIT), Stanford, Carnegie Mellon University, and Yale. DARPA funded most of this research.*

MIT and Project MAC

MIT housed not a laboratory per se but what would be called Project MAC.* MAC was an acronym for Mathematics and Computation. The choice of creating a project instead of a lab stemmed from internal politics. Started by Robert Fano in July 1963, Project MAC would eventually turn into the Computer Science and Artificial Intelligence Lab (CSAIL) inside MIT. This project was responsible for research in the areas of artificial intelligence, operating systems, and theory of computation. DARPA provided a $2M grant for MIT’s Project MAC.

Marvin Minsky directed the AI Group inside Project MAC. John McCarthy was also a member of the group, and while there he created the high-level language Lisp in 1958, which became the dominant AI programming language for the next 30 years. At the time, credentialed computer scientists did not exist because universities did not have computer science programs yet. So, everyone involved in the project was either a mathematician, physicist, electrical engineer, or a dropout.

Figure: John McCarthy, Lisp language inventor.*

Project MAC was responsible for many inventions,* including the creation of the first computer-controlled robotic arm by Marvin Minsky and the first chess-playing* program. The program, developed by McCarthy’s students, beat beginner chess players and used the same main techniques as Deep Blue, the computer-chess program that would beat Grandmaster Garry Kasparov years later.


The world is composed of many environments, each with different rules and knowledge. Russian grammar rules differ from those of English, which are entirely different from the standards for geometry. In 1970, Minsky and Seymour Papert suggested constraining their research into isolated areas; that is, they would focus on Micro-Worlds.* They concentrated on specific domains to see if programs could understand language in an artificially limited context. Most of the computer programs developed during the Golden Years focused on these Micro-Worlds.

One such program was SHRDLU, which was written by Terry Winograd at the MIT AI Lab to understand natural language.* In this experiment, the computer worked with colored blocks using a robotic arm and a video camera. SHRDLU responded to commands typed in English, such as “Grasp the pyramid.” The goal of this process was to build one or more vertical stacks of blocks. Some blocks could not be placed on top of others, making the problem more complex.

But the tasks involved more than merely following commands. SHRDLU performed actions in order to answer questions correctly. For example, when the person typed, “Can a pyramid be supported by a pyramid?”, SHRDLU tried to stack two pyramids and failed. It then responded, “I can’t.” While many thought the SHRDLU program was a breakthrough, and it was considered a wildly successful demonstration of AI, Winograd realized that expanding outside the Micro-World for broader applications was impossible.

Figure: Marvin Minsky and his SHRDLU-controlled robotic arm.


After McCarthy left MIT in 1962,* he became a professor at Stanford, where he started a lab called the Artificial Intelligence Center.* The laboratory focused most of its energy on speech recognition, and some of their work became the foundation for Siri, Apple’s virtual assistant.* The laboratory also worked on robotics and created one of the first robots, Shakey. Developed from 1966 to 1972, it was the first robot to break down large tasks into smaller ones and execute them without a human directing the smaller jobs.

Shakey’s actions included traveling from one location to another, opening and closing doors, turning light switches on and off, and pushing movable objects.* The robot occupied a custom-built Micro-World consisting of walls, doors, and a few simple wooden blocks. The team painted the baseboards on each wall so that Shakey could “see” where the walls met the floor.

Lisp was the language used for the planning system, and STRIPS, the computer program responsible for planning Shakey’s actions, would become the basis for most automated planners. The robot included a radio antenna, television camera, processors, and collision-detection sensors. The robot’s tall structure and its tendency to shake resulted in its name. Shakey worked in an extremely limited environment, something critics pointed out, but even with these simplifications, Shakey still operated disturbingly slowly.

Figure: Shakey, the first self-driving robot.

Carnegie Mellon University

Another prominent laboratory working on artificial intelligence was inside Carnegie Mellon University. At CMU, Bruce T. Lowerre developed Harpy, a speech recognition system.* This work started around 1971, and DARPA funded five years of the research. Harpy was a breakthrough at the time because it recognized complete sentences. One difficulty in speech is knowing when one word ends and another begins. For example, “euthanasia” could be misconstrued for “youth in Asia.” By 1976, Harpy could understand speech for 1,011 words from different speakers and translate it into text with a 90% accuracy rate.*

The Automatic Language Processing Committee (ALPAC) was created in 1964 by the US government “to evaluate the progress in computational linguistics in machine translation.”* By 1966, the committee reported it was “very skeptical of research done in machine translation so far, and emphasiz[ed] the need for basic research in computational linguistics” instead of AI systems. Because of this negative view, the government greatly reduced its funding.


At Yale, Roger Schank and his team used Micro-Worlds to explore language processing. In 1975, the group began a program called SAM, an acronym for Script Applier Mechanism, that was developed to answer questions about simple stories concerning stereotypical matters such as dining in a restaurant and traveling on the subway.

The program could infer information that was implicit in the story. For example, when asked, “What did John order?” SAM replied, “John ordered lasagna,” even though the story stated only that John went to a restaurant and ate lasagna.* Schank’s team worked on a few different projects, and in 1977, their work also included another computer program called FRUMP, which summarized wire-service news reports into three different languages.

Geometry Theorem Prover

At IBM, Nathaniel Rochester and his colleagues produced some of the first AI programs. In 1959, Herbert Gelernter constructed the Geometry Theorem Prover, a program capable of proving theorems that many students of mathematics found quite tricky. His program “exploited two important ideas. One was the explicit use of subgoals (sometimes called ‘reasoning backward’ or ‘divide and conquer’), and the other was the use of a diagram to close off futile search paths.”* Gelernter’s program created a list of goals, subgoals, sub-subgoals, and so on, expanding more broadly and deeply until the goals were solvable. The program then traversed this chain to prove the theorem true or false.


Figure: SAINT.

A heuristic is a rule that helps to find a solution for a problem by making guesses about the best strategy to use given the state.*

In 1961, James Slagle wrote the program SAINT, Symbolic Automatic Integrator, which was responsible for solving simple algebra equations. The SAINT system performed integration through a “heuristic” processing system.

SAINT divided the problem into subproblems, searched those for possible solutions, and then tested them. As soon as these subproblems were solved, SAINT could resolve the main one as well.

SAINT became the foundation for Wolfram Mathematica, which is a valuable tool widely used today in the scientific, engineering, and computational fields. SAINT, however, was not the only program that addressed school problems. Others, such as Daniel Bobrow’s program called “word problems,” solved algebra problems described in simple sentences like, “The consumption of my car is 15 miles per gallon.”*

The First Chatbot, ELIZA

Figure: ELIZA software running on a computer.

Created by Joseph Weizenbaum in 1964, ELIZA was the first version of a chatbot.* It spammed people and did not pass the Turing test, but it was an early natural language processing program that demonstrated where AI could head in the future. It talked to anyone who typed sentences into a computer terminal with it installed.

ELIZA simply followed a few rules to try and identify the most important keywords in a sentence. With that information, the program attempted to reply to the questions based on that content. ELIZA disassembled the input and then reassembled it, creating a response using data entered by the user. For example, if the user entered, “You are very helpful.” ELIZA would take the input and first create the sentence, “What makes you think I am,” then it would add the rest from the deconstructed initial input, leading to the final sentence, “What makes you think I am very helpful?” If the program could not find such keywords, ELIZA responded with a remark that lacked content, like “Please go on.” or “I see.” ELIZA and today’s Alexa would not be too different from each other.

The First AI Winter (1974–1980)

It’s difficult to be rigorous about whether a machine really ‘knows’, ‘thinks’, etc., because we’re hard put to define these things. We understand human mental processes only slightly better than a fish understands swimming.John McCarthy*

The First AI Winter started with funds drying up after many of the early promises did not pan out as expected. The most famous idea coming out of this era was the Chinese room argument, one that I personally disagree with, that states that artificial intelligence systems can never achieve human-level intelligence.

Lack of Funding

From 1974 to 1980, AI funding declined drastically, making this time known as the First AI Winter. The term AI winter was explicitly referencing nuclear winters, a name used to describe the aftermath of a nuclear attack when no one can live in the area due to the high radiation. In the same way, AI research was in such chaos that it would not receive funding for many years.

Critiques and financial setbacks, a consequence of the many unfulfilled promises during the early boom in AI, caused this era. From the beginning, AI researchers were not shy about making predictions of their future successes. The following statement by Herbert Simon in 1957 is often quoted, “It is not my aim to surprise or shock you … but the simplest way I can summarize is to say that there are now in the world machines that think, that can learn and that can create. Moreover, their ability to do these things is going to increase rapidly until—in a visible future—the range of problems they can handle will be coextensive with the range to which the human mind has been applied.”*

Terms such as “visible future” can be interpreted in various ways, but Simon also made more concrete predictions. He said that within 10 years a computer would be a chess champion and a machine would prove a significant mathematical theorem. With Deep Blue’s victory over Kasparov in 1996 and the proof of the Four Color Theorem in 2005 using general-purpose theorem-proving AI, these predictions came true within 40 years, 30 years longer than predicted. Simon’s overconfidence was due to the promising performance of early AI systems on simple examples. However, in almost every case, these early systems turned out to fail miserably when applied to broader or more difficult problems.

The first type of complication arose because most early programs knew nothing of their subject matter but rather succeeded using simple syntactic manipulations. A typical story occurred in early machine translation efforts, which were generously funded by the US National Research Council in an attempt to speed up the translation of Russian scientific papers in the wake of the Sputnik launch in 1957. It was thought initially that simple syntactic transformations, based on the grammar rules of Russian and English and word replacements from an electronic dictionary, would suffice to preserve the exact meanings of sentences. The fact is that accurate translation requires background knowledge to resolve ambiguity and establish the content of the sentence. A report by ALPAC criticizing machine translation efforts caused another setback. After spending $20M, The National Academy of Sciences, Engineering, and Medicine ended support for AI research based on this report.

Much criticism also came from AI researchers themselves. In 1969, Minsky and Papert published a book-length critique of perceptrons, the basis of early neural networks.* They claimed that a neural network with more than one layer would not be powerful enough to be useful to replicate intelligence. Ironically, multilayer neural networks, also known as deep neural networks (DNNs), would eventually cause an enormous revolution in multiple tasks, including language translation and image recognition, and become the go-to machine learning technique for researchers.

In 1973, following the same pattern of criticism of AI research, a report known as the Lighthill Report, written by James Lighthill for the British Science Research Council, gave a very pessimistic forecast of the field.* It stated, “In no part of the field have discoveries made so far produced the major impact that was then promised.” Following this report and others, DARPA withdrew its funding from the Speech Understanding Research at CMU, canceling $3M of annual grants. Another significant setback for AI funding was because of the Mansfield Amendment, passed by Congress, which limited military funding for research that lacked a direct or apparent relationship to a specific military function.* It resulted in DARPA funding drying up for many AI projects.


Criticism came from everywhere, including philosophers. Hubert Dreyfus, an MIT Philosophy Professor, criticized what he called the two assumptions of AI: the biological and psychological assumptions.*

The biological assumption refers to the brain being analogous to computer hardware and the mind being equivalent to computer software. The psychological assumption is that the mind performs discrete computations on discrete representation or symbols. Unfortunately, these concerns were not taken seriously by AI researchers. Dreyfus was given the cold shoulder and later claimed that AI researchers “dared not be seen having lunch with me.”

The Chinese Room Argument

One of the strongest and most well-known arguments against machines ever having real intelligence marked the end of the AI Winter. In 1980, John Searle, a philosophy professor at the University of California, Berkeley, introduced the Chinese room argument as a response to the Turing test. This argument proposed that a computer program cannot give a computer a mind, understanding, or consciousness.

Searle compared a machine’s understanding of the Chinese language to the understanding of someone who does not know Chinese but can read the dictionary and translate every word from English to Chinese and vice versa. In the same way, a machine would not have real intelligence. His argument stated that even if the device passed the Turing test, it did not mean that the computer literally had intelligence. A computer that translates English to Chinese does not necessarily understand Chinese. It could indicate that it is merely simulating the intelligence needed to understand Chinese.

Searle used the term Strong AI to refer to a machine with real intelligence, the equivalent of understanding Chinese instead of only translating word by word. But in the case when computers do not have real intelligence, such as simply translating Chinese words instead of actually understanding the meanings of the words, the machine has Weak AI.

Figure: Representation of the Chinese room argument.

In the Chinese room argument, a person inside a room who has a dictionary and translates English sentences and spills out Chinese sentences does not really understand Chinese.

Searle said that there would not be a theoretical difference between him using a device that translates directly from a dictionary versus using one that understands Chinese. Each simply follows a program, step by step, which demonstrates a behavior described as intelligence. But in the end, the one who translates using a dictionary would not be able to understand Chinese, which is the same way a computer can appear to have intelligence but only the appearance of having it. Some argue that Strong AI and Weak AI are the same with no real theoretical difference between the two.

The problem with this argument is that there is not a clear boundary between Weak AI and Strong AI. How can you determine whether someone really understands Chinese or they are mimicking the behavior? If the machine can translate every possible sentence, is that understanding or mimicking?

The AI Boom (1980–1987)

It’s fair to say that we have advanced further in duplicating human thought than human movement.Garry Kasparov*

This era was marked by expert systems and increased funding in the 80s. The development of Cog, iRobot, and Roomba by Rodney Brooks and the creation of Gammonoid, the first software to win backgammon against a world champion both took place during this period. This era ended with Deep Blue, the first computer software to win against a world-champion chess player.

After the First AI Winter, research picked up with new techniques that showed great results, heating up investment in research and development in the area and sparking the creation of new AI applications in enterprises. Simply put, the 1980s saw the rebirth of AI.

The research up until 1980 focused on general-purpose search mechanisms trying to string together elementary reasoning steps to find complete solutions. Such approaches were called weak methods because, although they applied to general problems, they did not scale up to larger or more difficult situations.

The alternative to these weak methods was to use more powerful domain-specific knowledge in expert systems that allowed more reasoning steps and could better handle typically occurring cases in narrow areas of expertise.* One might say that to solve a hard problem, you almost have to know the answer already. This philosophy was the leading principle for the AI boom from around 1980 to 1987.

Expert Systems

The basic components of an expert system are the knowledge base and the inference engine. The knowledge base consists of all the important data for the domain-specific task. For example, in chess, this knowledge includes all the game’s rules and the points that each piece represents when playing. The information in the knowledge base is typically obtained by surveying experts in the area in question.

The inference engine enables the expert system to draw conclusions through simple rules like: If something happens, then something else happens. Using the knowledge base with these simple rules, the inference engine figures out what the system should do based on what it observes. In a chess game, the system can decide which piece to move by analyzing which moves are possible and which are the best ones based on the pieces remaining on the board. The inference engine makes a decision based on this knowledge.

To understand the upsurge in AI, we must look at the state of computer hardware. Personal computers made huge strides from the early to mid-80s. While the Tandem/16 was available, the Apple II, TRS-80 Model I, and Commodore PET were marketed to single users for a much lower price and better sound and graphics, although with less power regarding memory. The target use for these computers was for word processing, video games, and school work. As the 1980s progressed, so did computers with the release of Lotus 1-2-3 in 1983 and the introduction of the Apple Macintosh and the modern graphical user interface (GUI) in 1984. Personal computing hardware exploded by leaps and bounds during this time. And, the same was true for AI.

The domain-specific systems of expert systems also proliferated. Large corporations around the world began adopting these systems because they leveraged desktop computers rather than expensive mainframes. For example, expert systems helped Wall Street firms automate and simplify decision making in their electronic trading systems. Suddenly, some people started assuming that computers could be intelligent again.

AI research and development projects received over $1B, and universities around the world with AI departments cheered. Companies developed not only new software but also specialized hardware to meet the needs of these new AI applications. New computer languages, including Prolog and Lisp, were developed to address the needs of these applications. Prolog, for example, was built around the rule-based system with primitives, such as a “rule,” that define and build on these if and else statements.

Many AI companies were created, including hardware-specific companies like Symbolics and Lisp Machines and software-specific companies like Intellicorp and Aion. Venture capital firms, which invest in early-stage companies, emerged to fund these new tech startups with visions of billion-dollar exits. For the first time, technology firms received a dedicated pool of money, which sped up the development of AI.

In part, the explosion of expert systems was due to the success of XCON (​​for eXpert CONfigurer), which was written by John P. McDermott at CMU in 1978.* An example of a rule that XCON had in its repertoire was:

If: the current context is assigning devices to unibus modules and there is an unassigned dual port disk drive and the type of controller it requires is known and there are two such controllers neither of which has any devices assigned to it and the number of devices that these controllers can support is known

Then: assign the disk drive to each of the controllers and note that the two controllers have been associated and that each supports on device*

Digital Equipment Corporation (DEC) used the system to automatically select computer components that met user requirements, even though internal experts sometimes disagreed regarding the best configuration. Before XCON, individual components, such as cables and connections, had to be manually selected, resulting in mismatched components and missing or extra parts. By 1986, XCON had achieved 95% to 98% accuracy and saved $25M annually, arguably saving DEC from bankruptcy.


Figure: Rodney Brooks, with his two robots, Sawyer and Baxter.*

Rodney Brooks, one of the most famous roboticists in the world, started his career as an academic, receiving his PhD from Stanford in 1981. Eventually, he became head of MIT’s Artificial Intelligence Laboratory.

At MIT, Rodney Brooks* defined the subsumption architecture, a reactive robotic architecture.* Robots that followed this architecture guided themselves and moved around based on reactive behaviors. That meant that robots had rules on how to act based on the state of the world and would react to the world as it was at that moment. This was different from the standard way of programming robots, in which they created a model of the world and made decisions based on that potentially stale information.

For Brooks, robots had a set of sub-behaviors that were organized in a hierarchy. Robots interacted with the world and reacted to it based on those behaviors. This theory and the demonstration of his work made him famous worldwide and, consequently, slowed down robotics research in Japan. At the time, Japanese research focused on a different software architecture for robotics, and with the success of this new theory, investment in other types of robotics research dwindled, especially in Japan.

Based on the subsumption architecture, Brooks developed a robot that could grab a coke bottle, something that was unimaginable before. He believed that the key to achieving intelligence was to build a machine that experienced the world in the same way a human does. Brooks was considered a maverick in his field. He used to say, “Most of my colleagues here in the lab do very different things and have only contempt for my work,”* saying that most of the other researchers were pursuing GOFAI, or Good Old-Fashioned Artificial Intelligence, or “brain in the box.” Instead, he stated that his robots had full knowledge about the real world and interacted with it. He used to say that GOFAI is like intelligence that only has access to the Korean dictionary: it can have self-consistent definitions, but they are unconnected to the world in any way.

In his 1990 paper, “Elephants Don’t Play Chess,”* he took direct aim at the physical symbol system hypothesis,* which states that a system based on rules for managing symbols has the necessary and sufficient means for general intelligent actions. It implies that human thinking can be reduced to symbol processing and that machines can achieve artificial general intelligence with a symbolic system.

For example, a chess game can be reduced to a symbolic mathematical system where pieces become symbols that are manipulated in each turn. The physical symbol system hypothesis stated that all types of thinking boiled down to mathematical formulas that can be manipulated. Brooks disagreed with the theory, arguing that symbols are not always necessary since “the world is its own best model. It is always exactly up to date. It always has every detail there is to be known. The trick is to sense it appropriately and often enough.” Therefore, you could create intelligent machines without having to create a model of the world.

Brooks started his research working on robotic insects that could move around their environment. At the time, researchers could not build robotic insects that moved quickly, even though that seems trivial. Brooks argued that the reason why robots from other researchers were slow is that they were GOFAI robots that relied on a central computer with a three-dimensional map of the terrain. He argued that such a system was not necessary and was even cumbersome for achieving the task of making robots move around their environment.

Brooks’s robots were different. Instead of having a central processor, each of his insects’ legs contained a circuit with a few simple rules, such as telling the leg to swing if it was in the air or move backward if it was on the ground. With these rules tied together, plus the interaction of the body with the circuits, the robots would walk similarly to how actual insects walk. The computation that the robots performed was always coupled physically with their body.

Figure: Cog with Rodney Brooks.*

Brooks’s initial plan was to move up the biological ladder of the animal kingdom. First, he would work on robots that looked like insects, then an iguana, a simple mammal, a cat, a monkey, and eventually a human. But he realized that his plan would take too long, so he jumped directly from insects to building a humanoid robot named Cog.*

Cog was composed of black and chrome metal beams, motors, wires, and a six-foot-tall rack of boards containing many processing chips, each as powerful as a Mac II. It had two eyes, each consisting of a pair of cameras: one fisheye lens to give a broad view of what was going on, and one normal lens that gave this humanoid a higher-resolution representation of what was directly in front of it.

Cog had only the basics from the software standpoint: some primitive vision, a little comprehensive hearing, some sound generation, and rough motor control. Instead of having the behavior programmed into it, Cog developed its behavior on its own by reacting to the environment. But it could do little besides wiggle its body and wave its arm. Ultimately, it ended up in a history museum in Boston.

Figure: Roomba robot cleaner.*

In 1990, after his stint developing Cog, Professor Brooks started iRobot. The name is a tribute to Isaac Asimov’s science fiction book I, Robot. In the years ahead, the company built robots for the military to perform tasks like disarming bombs. Eventually, iRobot became well-known for creating home robots. One of its most famous robots is a vacuum cleaner named Roomba.

The way ants and bees search for food inspired the software behind Roomba.* When the robot starts, it moves in a spiral pattern. It spans out over a larger and larger area until it hits an object. When it encounters something, it follows the edge of that object for a period of time. Then, it crisscrosses, trying to determine the largest distance it can go without hitting something else. This process helps Roomba work out how large the space is. But if it goes too long without hitting a wall, the vacuum cleaner starts spiraling again because it figures it is in an open space. It constantly calculates how wide the area is.

Figure: A visualization of how the Roomba algorithm works.

Roomba combines this strategy with another one based on its underneath dirt sensors. When it detects dirt, it changes its behavior to cover the immediate area. It then searches for another dirty area on a straight path. According to iRobot, these different combined patterns create the most effective way to traverse a room. At least 10 million Roombas have been sold worldwide.

Brooks left iRobot to start a new company, Rethink Robotics, which specializes in making robots for manufacturing. In 2012, they developed their first robot, named Baxter, to help small manufacturers pack objects. Rethink Robotics introduced Sawyer in 2015 to perform more detailed tasks.


Professor Hans Berliner from CMU created a program called BKG 9.9 that in July 1979 played the world backgammon champion in Monte Carlo.* The program controlled a robot called Gammonoid, whose software was one of the largest examples of an expert system. Gammonoid’s software was running on a giant computer at CMU in Pittsburgh, Pennsylvania, 4,000 miles away, and gave the robot instructions via satellite communication. The winner of the human-versus-robot games would take home $5K. Not much was expected of the programmed robot because the players knew that the existing microprocessors could not play backgammon well. Why would a robot be any different?

The opening ceremony reinforced the view that the robot would not win. When appearing on the stage in front of everyone, the robot entangled itself in the curtains, delaying its appearance. Despite this, Gammonoid became the first computer program to win the world championship of a board or card game.* It won seven games to one. Luigi Villa, the human opponent, could hardly believe it and thought the program was lucky to have won two of the games, the third and the final one. But Professor Berlinger and some of his researchers had been working on this backgammon machine for years.

The former world champion, Paul Magriel, commented on the game, “Look at this play. I didn’t even consider this play … This is certainly not a human play. This machine is relentless. Oh, it’s aggressive. It’s a really courageous machine.” Artificial intelligence systems were starting to show their power.

Deep Blue

With this new power came the birth of systems like HiTech and Deep Thought, which would eventually defeat chess masters. These were the precursors of Deep Blue, the system that would become the world-champion chess software, and they were all developed in laboratories at Carnegie Mellon University.

Deep Thought, initially developed by researcher Feng-hsiung Hsu in 1985, was sponsored and eventually bought by IBM. Deep Thought went on to win against all other computer programs in the 1988 American World Computer Chess Championship. The AI then competed against Bent Larsen, a Danish chess Grandmaster, and became the first computer program to win a chess match against a Grandmaster. The results impressed IBM, so they decided to bring development in-house, acquiring the program in 1995 and renaming it Deep Blue.

Figure: Garry Kasparov playing against IBM’s Deep Blue.

In the following year, 1996, it competed against the world chess champion, Garry Kasparov. Deep Blue became the first machine to win a chess game against a reigning world champion under regular time controls, although it ultimately lost the series. They played six matches; Deep Blue won one, drew two, and lost three.

The next year in New York City, Deep Blue—now unofficially called Deeper Blue—improved and won the match series against Garry Kasparov, winning two, drawing three, and losing one. It was the first computer program to win the chess world championship.

Deep Blue’s Brain

technical The software behind Deep Blue, which won against Kasparov, was running an algorithm called min-max search with a few additional tricks. It searched through millions or billions of possibilities to find the best move. Deep Blue was able to look at an average of 100 million chess positions per second while computing a move. It analyzed the current state and figured out the potential next moves based on the game rules. The program also calculated the possible “value” of the next play based on each player’s best moves after that move and what those would mean to the game. This process used an inference system on a knowledge base—an expert system.

Deep Blue’s algorithm minimized the possible loss for a worst-case scenario while maximizing the minimum gain for a potential win. It maximized the points that the software would get in the next moves and minimized the points that the opponent could get given a certain play. Hence the name min-max search. In a chess match, ways exist to identify the outcome of a state of a chessboard by looking at how many chess pieces remain on the table for each player. Each piece has a value attached to it. For example, a knight is worth three points, a rook five, and a queen ten. So, in a state where a player loses a rook and a knight, that position is worth eight points less for that player.

Figure: An example of the min-max algorithm. We can calculate the “value” of each board by looking at the value at the bottom positions in the tree. For example, when there is one white knight and one black knight and a black rook, the total value of the board for the white player is 3-(3+5)=-5. So, we can calculate the value of each board at the bottom level. The value of the board at the next level up is the minimum value of all the ones below it, giving -8 and -5 as the board values for the second row. And at the top row, we use the maximum value, which is -5.

In another example, pawns could be worth one point, bishops and knights three, and a queen nine. The figure above depicts a simplified version of a game and demonstrates how the min-max algorithm works. On the first play, the white bishop can take a black bishop or knight, and either move would mean that the opponent would lose three points. So, both moves would be the same if they were the only factors involved in the search for the best move. But the software also analyzes the moves that the opponent could make in response and determines what the best move is for the opponent. If the bishop takes the opponent’s knight on the first move (the board on the left), the best countermove for the opponent is to take the bishop with its tower. That is not so good for the white player, the computer, because black would take its bishop, removing the three points it gained. The other possible first move for the white player is to take the bishop (the right board). If it does that, the other player cannot take any white piece, which is very good for the white player, because it would be up three points. So, if the software analyzed only the next two moves, that would be the best move. Therefore, looking at only the possible next two moves, taking the bishop is the better of the two options.

Deep Blue, however, examined more than two moves ahead. It looked at all the possible future moves that it could, based on its time limit and the information it had, and chose the best move. This is why it needed to analyze so many moves per second. Deep Blue was the first such system to analyze that many possible future scenarios. This system and many of the future developments in artificial intelligence required a lot of computing power. That is no surprise because the human brain also has a very powerful computing capability. When playing chess, human players also examine future moves, but humans rely on their memory of past games and the best moves in a certain scenario. The development of more complex games like Go used memory like this, but for chess, it was not necessary since Deep Blue could analyze most of the possible future scenarios while playing the game. For more complex games, software simply cannot look at all possible scenarios in a timely manner.

The Second AI Winter (1987–1993)

I am inclined to doubt that anything very resembling formal logic could be a good model for human reasoning.Marvin Minsky*

Beginning in 1987, funds once again dried up for several years. The era was marked by the qualification problem, which many AI systems encountered at the time.

An expert system requires a lot of data to create the knowledge base used by its inference engine, and unfortunately, storage was expensive in the 1980s. Even though personal computers grew in use during the decade, they had at most 44MB of storage in 1986. For comparison, a 3-minute MP3 music file is around 30MB. So, you couldn’t store much on these PCs.

Not only that, but the cost to develop these systems for each company was difficult to justify. Many corporations simply could not afford the costs of AI systems. Added to that were problems with limited computing power. Some AI startups, such as Lisp Machines and Symbolics, developed specialized computing hardware that could process specialized AI languages like Lisp, but the cost of the AI-specific equipment outweighed the promised business returns. Companies realized that they could use far cheaper hardware with less-intelligent systems but still obtain similar business outcomes.

A warning sign for the new wave of interest in AI was that expert systems were unable to solve specific, computationally hard logic problems, like predicting customer demand or determining the impact of resources from multiple, highly variable inputs. Newly introduced enterprise resource planning (ERP) applications started replacing expert systems. ERP systems dealt with problems like customer relationship management and supplier relationship management, and they proved very valuable to large enterprises.

The Qualification Problem

The qualification problem states that there is no way to predict all the possible outcomes and circumstances preventing the successful of an action, but the system still must recover from these unexpected failures. Reasoning agents in real-world environments rely on a solution to the qualification problem to make useful predictions.

For example, imagine that a program needs to drive a car with only if and then rules. A multitude of unexpected cases makes it impossible to handwrite all the rules for the application. Identifying cars and pedestrians is already extremely hard. A self-driving car not only needs to identify objects but also needs to drive around things (and people) based on its detection of such objects. Most people do not think of all the possible cases before they start writing the program. For example, if the program detects a human, is that human a pedestrian, a reflection, or someone riding in the bed of a pickup truck? The program also needs to be able to tell when vehicles tow other vehicles. These examples only scratch the surface of possible exceptions to the rules.

Unmet Promises

Expert systems fell prey to the qualification problem, and that caused a collapse of funding in AI funding because the systems could not achieve much of what it promised. The Second AI Winter began with the sudden collapse of the market for specialized AI hardware in 1987.* Desktop computers from IBM and Apple were steadily gaining market share. But 1987 became the turning point for these AI manufacturers when Apple’s and IBM’s computers became more powerful and cheaper than the specialized Lisp machines. Not only that, but Reagan’s Star Wars missile defense program experienced a huge slowdown because DARPA had invested heavily in AI solutions. This event, in turn, severely damaged Symbolics, one of the main Lisp machine makers, creating a cascading effect.

Figure: One of the computers developed by the Fifth Generation Computer Systems program.

In addition to that, Fifth Generation Computer Systems, which was an initiative by the Japanese government to create a computer using massively parallel computing, was shut down during this period. The name of the project came from the fact that up until this time, there had been four generations of computing hardware:

  1. Vacuum tubes,

  2. Transistors and diodes,

  3. Integrated circuits, and

  4. Microprocessors.

The Japanese initiative represented a new generation of computers. Previously, computers focused on increasing the number of logic components in a single central processing unit (CPU), but Japan’s project, and others of its time, focused on boosting the number of CPUs for better performance. This enormous computer was intended to be a platform for future development in artificial intelligence. Its goal was to respond to natural language input and be capable of learning. But general-purpose Intel x86 machines and Sun workstations had begun surpassing specialized computer hardware. Because of that and the high cost of the project—around $500M in total—the Japanese cut the initiative after a decade. The project’s end marked a failure of the massively parallel processing approach to AI.

In the United States, most of the projects of this era were also not working as expected. Eventually, the first successful expert system, XCON, proved too expensive to maintain. The system was complicated to update, could not learn, and suffered from the qualification problem.

The Strategic Computing Initiative (SCI),* another large program developed by the US government from 1983 to 1993, was inspired by Japan’s Fifth Generation Computer Systems project. It focused on chip design, manufacturing, and computer architecture for AI systems. The integrated program included projects at companies and universities that were designed to eventually come together. Funded by DARPA, the effort “was supposed to develop a machine that would run ten billion instructions per second to see, hear, speak, and think like a human.”* By the late 1980s, however, it was apparent that the initiative would not succeed in its AI goals, leading DARPA to cut funding “deeply and brutally.” This event, in addition to the numerous companies that had gone out of business, led to the Second AI Winter. The beginning of probabilistic reasoning marked the end of this winter and provided an altogether new approach to AI.

Probabilistic Reasoning (1993–2011)

Probability is orderly opinion and inference from data is nothing other than the revision of such opinion in the light of relevant new information.*

Probabilistic reasoning was a fundamental shift from the way that problems were addressed previously. Instead of adding facts, researchers started using probabilities for the occurrence of facts and events, building networks of how the probability of each event occurring affects the probability of others. Each event has a probability associated with it, as does each sequence of events. These probabilities plus observations of the world are used to determine, for example, what is the state of the world and what actions are appropriate to take.

Probabilistic reasoning involves techniques that leverage the probability that events will occur. Judea Pearl’s influential work, in particular with Bayseian networks, gave new life to AI research and was central to this period. Maximum likelihood estimation was another important technique used in probabilistic reasoning. IBM Watson, the last successful system to use probabilistic reasoning, built on these foundations to beat the best humans at Jeopardy!

Judea Pearl

The work pioneered by Judea Pearl marked the end of the Second AI Winter.* His efforts ushered in a new era, arguably creating a fundamental shift in how AI was applied to everyday situations. One could even go so far as to say that his work laid much of the groundwork for artificial intelligence systems up to the end of the 1990s and the rise of deep learning. In 1985, Pearl, a professor at the University of California, Los Angeles, introduced the concept of Bayesian networks.* His new approach made it possible for computers to calculate probable outcomes based on the information they had. He had not only a conceptual insight but also a technical framework to make it practical. Pearl’s 1988 book, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, became the bible of AI at the time.

The techniques developed in this area were much more useful than just logic alone because probabilities represent more information than a conclusion. For example, stating “there is a 75% chance of rain in the next hour” conveys more information than “it is going to rain in the next hour,” especially because events in the future are not certain.

Named after the 18th-century mathematician Thomas Bayes, Dr. Pearl’s work on Bayesian networks provided a basic calculus for reasoning with uncertain information, which is everywhere in the real world. In particular, Pearl’s technique was the foundational framework for reasoning with imperfect data that changed how people approached real-world problem solving. Pearl’s research was instrumental in moving machine-based reasoning from the rules-bound expert systems of the 1980s to a calculus that incorporated uncertainty and probabilistic models. In other words, he figured out methods for trying to draw the best conclusion even when there is a degree of unpredictability.

Bayesian networks were applied when trying to answer questions from a vast amount of unstructured information or when trying to figure out what someone said in languages like Chinese that have many similar-sounding words. His work applied to an extensive range of applications, from medicine and gene work to information retrieval and spam filtering, in which only partial information is available.*

Bayesian networks provided a compact way of representing probability distributions. The Bayesian network formalism was invented to allow efficient representation and rigorous reasoning with uncertain knowledge. This approach largely overcame many problems of the systems of the 1960s and 1970s, and, simply put, dominated AI research on uncertain reasoning. In the 1960s, to overcome the problems of time and space complexity, simplifying assumptions had to be made, and the systems were small in scale and computationally expensive.* The 1970s shifted to using probability theory, but unfortunately, this theory could not be applied straightforwardly. Even with modifications, it could not solve the problems of uncertainty.*

Bayesian Networks

Bayesian networks are a way of representing the dependencies between events and how the occurrences or probabilities of events affect the probabilities of other events. They are based on Bayes’s theorem that states that the probability of an event happening depends on whether other relevant events have happened or the probability that they will happen.

For example, the likelihood of a person having cancer increases as the age of that person goes up. Therefore, a person’s age can be used to more accurately assess that they have cancer. Not only that, but Bayes’s rule also applies to the other side of the equation. For example, if you find out that someone has cancer, then the probability that the person is older is higher.

Figure: A Bayesian Network.

This figure shows an example of a simple Bayesian network. Each node has a table with the associated probability depending on leading nodes. You can calculate the probability that the grass is wet given that it is raining and the sprinkler is not on. Or you could, for example, determine the chances that a sprinkler is running if it is not raining and the grass is wet. The diagram simply shows the probability of one outcome based on the previous ones. Bayesian networks help determine the probability of something happening given the observation of other states.

In this case, the rules are simple to access, but when you have a lot of dependent events, Bayesian networks are a way of representing them and their dependencies. For example, stock prices may depend on many factors, including public sentiment, the central bank interest rates and bond prices, and the trading volume at the moment. A Bayesian network represents all these dependencies.

Bayesian networks addressed these problems by adding a framework that researchers could use when dealing with them. Even though they were useful in the 1990s, probabilistic reasoning does not address all possible cases due to the qualification problem described in the previous chapter. That is why, I believe, that probability reasoning fell out of favor in the early 2000s and deep learning has taken over the field since then. A few probabilities cannot describe how complex the world is.

Maximum Likelihood Estimation

technical Another technique used frequently during these years was maximum likelihood estimation (MLE). Based on a model of how the world should work and the observation of what is in the world, maximum likelihood estimation tries to determine the value of certain variables that would maximize the probability of such observation happening.

The idea behind it is that if you observe enough events occurring in the world, you would have enough samples to estimate the real probability distribution of these events. MLE finds the parameters of such a model that would best fit the observed data.

Figure: A normal distribution of heights of a given population.

For example, let’s say that you know that a normal distribution best describes the height of individuals for a certain country, like in the figure above. The y-axis represents the number of people with a certain height, and the x-axis represents the heights of the individuals. So in the center of this curve, we know that there are many people that have average height, and then as we move farther from the center on either side, there are fewer taller or shorter people.

With this technique, you can poll a lot of people to find out their height, and based on the data, you can determine the real distribution across the entire population. As shown in the figure below, after receiving the responses, you can then assume that the distribution of the heights of the entire population will be the one that maximizes the likelihood of those responses, the curved line from the first figure. The information on the distribution of heights of people inside a country can be useful for many applications. And with MLE, you can determine the most likely scenario for the heights of the population by surveying only a portion of the population.

Figure: This image shows the corresponding responses from a set of people. The curves on top of it are the assumed models based on the data given by the answers and the assumed normal distribution, using MLE. The blue line represents the height of men in the study’s population and the pink, the height of women.

In many ways, the use of probability for inference marked this period. It preceded the revolution that multilayer neural networks, also known as deep learning, would cause in the field. Probabilistic reasoning was successful in many applications and reached its peak with the development of IBM Watson. While Watson did not use Bayesian networks or maximum likelihood estimation for its calculations, it used probabilistic reasoning to determine the most likely answer.

IBM Watson

Watson was a project developed from 2004 to 2011 by IBM to beat the best humans at the television game show Jeopardy! The project was one of the last successful systems to use probabilistic reasoning before deep learning became the go-to solution for most machine learning problems.

Since Deep Blue’s victory over Garry Kasparov in 1997, IBM had been searching for a new challenge. In 2004, Charles Lickel, an IBM Research Manager at the time, identified the project after a dinner with co-workers. Lickel noticed that most people in the restaurant were staring at the bar’s television. Jeopardy! was airing. As it turned out, Ken Jennings was playing his 74th match, the last game he won.

Figure: The computer that IBM used for IBM Watson’s Jeopardy! competition.

Intrigued by the show as a possible challenge for IBM, Lickel proposed the idea of IBM competing against the best Jeopardy! players. The first time he presented the idea, he was immediately shut down, but that would change. The next year, Paul Horn, an IBM executive, backed Lickel’s idea. In the beginning, Horn found it challenging to find someone in the department to lead the project, but eventually, David Ferrucci, one of IBM’s senior researchers, took the lead. They named the project Watson after the father and son team who led IBM from 1914 to 1971, Thomas J. Watson Sr. and Jr.

In the Deep Blue project, the chess rules were entirely logical and could be easily reduced to math. The rules for Jeopardy!, however, involved complex behaviors, such as language, and were much harder to solve. When the project started, the best question-answering (QA) systems could only answer questions in very simple language, like, “What is the capital of Brazil?” Jeopardy! is a quiz competition where contestants are presented with a clue in the form of an answer, and they must phrase their response as a question. For example, a clue could be: “Terms used in this craft include batting, binding, and block of the month.” The correct response would be “What is quilting?”

IBM had already been working on a QA system called Practical Intelligent Question Answering Technology (Piquant)* for six years before Ferrucci started the Watson project. In a US government competition, Piquant correctly answered only 35% of the questions and took minutes to do so. This performance was not even close to what was necessary to win Jeopardy!, and attempts to adapt Piquant failed. So, a new approach to QA was required. Watson was the next attempt.

In 2006, Ferrucci ran initial tests of Watson and compared the results against the current competition. Watson was far below what was needed for live competition. Not only did it only respond correctly 15% of the time, compared to 95% for other programs, Watson was also slower. Watson had to be much better than the best software system at the time to have even the slightest chance to win against the best humans. The next year, IBM staffed a team of 15 and gave a timeframe of three to five years. Ferrucci and his team had much work to do.* And, they succeeded. In 2010, Watson was successfully winning against Jeopardy! contestants.

Figure: Comparison of precision and percentage of questions answered by the best system before IBM Watson and the top human Jeopardy! players.

What made the game so hard for Watson was that language was a very difficult problem for computers at the time. Language is full of intended and implied meaning. An example of such a sentence is “The name of this hat is elementary, my dear contestant.” People can easily detect the wordplay that evokes “elementary, my dear Watson,” a catchphrase used by Sherlock Holmes, and then remember that the Hollywood version of Sherlock Holmes wears a deerstalker hat. Programming a computer to infer this for a wide range of questions is hard.

To provide a physical presence in the televised games, Watson was represented by a “glowing blue globe criss-crossed by threads of ‘thought,’—42 threads, to be precise,”* referencing the significance of the number 42 in the book The Hitchhiker’s Guide to the Galaxy. Let’s go over how Watson worked.

Watson’s Brain

Watson’s main difference from other systems was its speed and memory. Stored in its memory were millions of documents including books, dictionaries, encyclopedias, and news articles. The data was collected either online from sources like Wikipedia or offline. The algorithm employed different techniques that together allowed Watson to win the competition. The following are a few of these techniques.

Learning from Reading

First, Watson “read” vast amounts of text. It looked at the text semantically and syntactically, meaning that it tried to tear sentences apart to understand them. For example, it identified the location of sentences’ subjects, verbs, and objects and produced a graph of the sentences, known as syntactic frames. Again, AI used learning techniques much like humans. In this case, Watson learned the basics of grammar similar to how an elementary student does.

Then, Watson correlated and calculated confidence scores for each sentence based on how many times and in what source the information was found. For example, in the sentence: “Inventors invent patents.” Watson identified “Inventors” as the subject of the sentence, “invent” as the verb, and “patents” as the object. The entire sentence has a confidence score of 0.8 because Watson found it in a few of the relevant sources. Another example is the sentence “People earn degrees at schools,” which has a confidence score of 0.9. A semantic frame contains a sentence, a score, and information about what each word is syntactically.

Figure: How learning from reading works.

This figure shows the process of learning from reading. First, the text is parsed and turned into syntactic frames. Then, through generalization and statistical aggregation, they are turned into semantic frames.

Searching for the Answer

Most of the algorithms in Watson were not novel techniques. For example, for the clue “He was presidentially pardoned on September 8, 1974,” the algorithm found that this sentence was looking for the subject. It then searched for possible subjects in semantic frames with similar words in them. Based on the syntactical breakdown done in the first step, it generated a set of possible answers. If one of the possible answers it found was “Nixon,” that would be considered a candidate answer. Next, Watson played a clever trick replacing the word “He” with “Nixon,” forming the new sentence “Nixon was presidentially pardoned on September 8, 1974.”

Then, it ran a new search on the generated semantic frame, checking to see if it was the correct answer. The search found a very similar semantic frame “Ford pardoned Nixon on September 8, 1974” with a high confidence score, so the candidate answer was also given a high score. But searching and getting a confidence score was not the only technique applied by Watson.

Evaluating Hypotheses

Evaluating hypotheses was another clever technique that Watson employed to help evaluate its answers. With the clue: “In cell division, mitosis splits the nucleus and cytokinesis splits this liquid cushioning the nucleus,” Watson searched for possible answers in the knowledge base that it acquired through reading. In this case, it found many candidate answers:

  • Organelle

  • Vacuole

  • Cytoplasm

  • Plasm

  • Mitochondria

Systematically, it tested the possible answers by creating an intermediate hypothesis, checking if the solutions fit the criterion of being liquid. It calculated the confidence of each one of the solutions being liquid using its semantic frames and the same search mechanism described above. The results had the following percentages:

  • is (“Cytoplasm”, “liquid”) = 0.2

  • is (“Organelle”, “liquid”) = 0.1

  • is (“Vacuole”, “liquid”) = 0.1

  • is (“Plasm”, “liquid”) = 0.1

  • is (“Mitochondria”, “liquid”) = 0.1

To generate these confidence scores, it searched through its knowledge base and, for example, found the semantic frame:

Cytoplasm is a fluid surrounding the nucleus.

It then checked to see if fluid was a type of liquid. To answer that, it looked at different resources, including WordNet, a lexical database of semantic relations between words, but did not find evidence showing that fluid is a liquid. Through its knowledge base, it learned that sometimes people consider fluid a liquid. With all that information, it created a possible answer set, with each answer having its own probability—a confidence score—assigned to it.

Cross-Checking Space and Time

Another technique Watson employed was to cross-check whether candidate answers made sense historically or geographically, checking to see which answers could be eliminated or changing the probability of a response being correct.

For example, for the clue: “In 1594, he took the job as a tax collector in Andalusia.” The two top answers generated by the first pass of the algorithm were “Thoreau” and “Cervantes.” When Watson analyzed “Thoreau” as a possible answer, it discovered that Thoreau was born in 1817, and at that point, Watson ruled that answer out because he was not alive in 1594.

Learning Through Experience

Jeopardy!’s questions are based in categories, limiting the scope of knowledge needed for each answer. Watson used that information to adjust its answer confidence. For example, in the category “Celebrations of the Month”, The first clue was “National Philanthropy Day and All Souls’ Day.” Based on its algorithm, Watson’s answer would be “Day of the Dead” because it classifies this category of the type “Day,” but the correct response was November. Because of that, Watson updated the category type to be a mix of “Day” and “Month,” which boosted answers that are of type “Month.” With time, Watson could update the type of response for a certain category.

Figure: IBM Watson updates the category type when its responses do not reflect the type of response for the correct answer. Then, it updates the possible category type based on the correct answers.

Practice Match

Figure: This image shows the evolution of different versions of IBM Watson throughout its different versions and upgrades.

These techniques were all employed together to make Watson perform at the highest level. In the beginning of 2011, IBM scientists decided that Watson was good enough to play against the best human opponent. They played a practice match before the press on January 13, 2011, and Watson won against Ken Jennings and Brad Rutter, two of the best Jeopardy! players. Watson ended the game with a score of $4,400, Ken Jennings with $3,400, and Brad Rutter with $1,200. Watson and Jennings were tied until the final question, worth $1,000—Watson won the game on that question. After the practice match, Watson was ready to play against the best humans in front of a huge audience on national television.

First Match

The first broadcasted match happened a month later on February 14, 2011, and the second match the next day. Watson won the first match but made a huge mistake. In the final round, Watson’s response in the US Cities category to the prompt “Its largest airport is named for a World War II hero; its second largest, for a World War II battle” was “What is Toronto??????” Alex Trebek, the host of Jeopardy! and a Canadian native, made fun of Watson, jokingly saying that he learned that Toronto was an American city.

David Ferrucci, the leading scientist, explained that Watson did not deal with structured databases, so it used US City as a clue to what the possible answer could include and that many American cities are named Toronto. Also, the Canadian baseball team, the Toronto Blue Jays, plays in the American Baseball League. That could be the reason why Watson considered Toronto to be one of the possible answers. Ferrucci also said that very often answers in Jeopardy! are not the types of things that are named in that category. Watson knew that, and so possibly considered that the category “US Cities” might be a clue to the answer. Watson used other elements to contribute to its response as well. The engineers also stated that its confidence was very low, which was indicated by the number of question marks after Watson’s answer. Watson had a 14% confidence percentage for “What is Toronto??????”. The correct answer, “What is Chicago?”, was a close second with an 11% confidence percentage. At the end of the first match, however, Watson had more than triple the money of the second-best competitor. Watson won with $35,734, Rutter with $10,400, and Jennings with $4,800.

Figure: David Ferrucci, the man behind Watson.

Second Match

To support Watson on the second day of the competition, one of the engineers wore a Toronto Blue Jays jacket. The game started, and Jennings chose the Daily Double clue. Watson responded incorrectly to the Daily Double clue for the first time in the two days of play. After the first round, Watson placed second for the first time in the competition. But in the end, Watson won the second match with $77,147; Jennings finished in second place with $24,000. IBM Watson made history as the first machine to win Jeopardy! against the best humans.

A Brief Overview of Deep Learning2 hours, 79 links

The fundamental shift in solving problems that probabilistic reasoning brought to AI from 1993 to 2011 was a big step forward, but probability and statistics only took developers so far. Geoffrey Hinton created a breakthrough technique called backpropagation to usher in the next era of artificial intelligence: deep learning. His work with multilayer neural networks is the basis of modern-day AI development.

Deep learning is a class of machine learning methods that uses multilayer neural networks that are trained through techniques such as supervised, unsupervised, and reinforcement learning.

In 2012, Geoffrey Hinton and the students at his lab showed that deep neural networks, trained using backpropagation, beat the best algorithms in image recognition by a wide margin.

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!