IBM Watson

15 minutes, 2 links


Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

Watson was a project developed from 2004 to 2011 by IBM to beat the best humans at the television game show Jeopardy! The project was one of the last successful systems to use probabilistic reasoning before deep learning became the go-to solution for most machine learning problems.

Since Deep Blue’s victory over Garry Kasparov in 1997, IBM had been searching for a new challenge. In 2004, Charles Lickel, an IBM Research Manager at the time, identified the project after a dinner with co-workers. Lickel noticed that most people in the restaurant were staring at the bar’s television. Jeopardy! was airing. As it turned out, Ken Jennings was playing his 74th match, the last game he won.

Figure: The computer that IBM used for IBM Watson’s Jeopardy! competition.

Intrigued by the show as a possible challenge for IBM, Lickel proposed the idea of IBM competing against the best Jeopardy! players. The first time he presented the idea, he was immediately shut down, but that would change. The next year, Paul Horn, an IBM executive, backed Lickel’s idea. In the beginning, Horn found it challenging to find someone in the department to lead the project, but eventually, David Ferrucci, one of IBM’s senior researchers, took the lead. They named the project Watson after the father and son team who led IBM from 1914 to 1971, Thomas J. Watson Sr. and Jr.

In the Deep Blue project, the chess rules were entirely logical and could be easily reduced to math. The rules for Jeopardy!, however, involved complex behaviors, such as language, and were much harder to solve. When the project started, the best question-answering (QA) systems could only answer questions in very simple language, like, “What is the capital of Brazil?” Jeopardy! is a quiz competition where contestants are presented with a clue in the form of an answer, and they must phrase their response as a question. For example, a clue could be: “Terms used in this craft include batting, binding, and block of the month.” The correct response would be “What is quilting?”

IBM had already been working on a QA system called Practical Intelligent Question Answering Technology (Piquant)* for six years before Ferrucci started the Watson project. In a US government competition, Piquant correctly answered only 35% of the questions and took minutes to do so. This performance was not even close to what was necessary to win Jeopardy!, and attempts to adapt Piquant failed. So, a new approach to QA was required. Watson was the next attempt.

In 2006, Ferrucci ran initial tests of Watson and compared the results against the current competition. Watson was far below what was needed for live competition. Not only did it only respond correctly 15% of the time, compared to 95% for other programs, Watson was also slower. Watson had to be much better than the best software system at the time to have even the slightest chance to win against the best humans. The next year, IBM staffed a team of 15 and gave a timeframe of three to five years. Ferrucci and his team had much work to do.* And, they succeeded. In 2010, Watson was successfully winning against Jeopardy! contestants.

Figure: Comparison of precision and percentage of questions answered by the best system before IBM Watson and the top human Jeopardy! players.

Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Now Available

What made the game so hard for Watson was that language was a very difficult problem for computers at the time. Language is full of intended and implied meaning. An example of such a sentence is “The name of this hat is elementary, my dear contestant.” People can easily detect the wordplay that evokes “elementary, my dear Watson,” a catchphrase used by Sherlock Holmes, and then remember that the Hollywood version of Sherlock Holmes wears a deerstalker hat. Programming a computer to infer this for a wide range of questions is hard.

To provide a physical presence in the televised games, Watson was represented by a “glowing blue globe criss-crossed by threads of ‘thought,’—42 threads, to be precise,”* referencing the significance of the number 42 in the book The Hitchhiker’s Guide to the Galaxy. Let’s go over how Watson worked.

Watson’s Brain

Watson’s main difference from other systems was its speed and memory. Stored in its memory were millions of documents including books, dictionaries, encyclopedias, and news articles. The data was collected either online from sources like Wikipedia or offline. The algorithm employed different techniques that together allowed Watson to win the competition. The following are a few of these techniques.

Learning from Reading

First, Watson “read” vast amounts of text. It looked at the text semantically and syntactically, meaning that it tried to tear sentences apart to understand them. For example, it identified the location of sentences’ subjects, verbs, and objects and produced a graph of the sentences, known as syntactic frames. Again, AI used learning techniques much like humans. In this case, Watson learned the basics of grammar similar to how an elementary student does.

Then, Watson correlated and calculated confidence scores for each sentence based on how many times and in what source the information was found. For example, in the sentence: “Inventors invent patents.” Watson identified “Inventors” as the subject of the sentence, “invent” as the verb, and “patents” as the object. The entire sentence has a confidence score of 0.8 because Watson found it in a few of the relevant sources. Another example is the sentence “People earn degrees at schools,” which has a confidence score of 0.9. A semantic frame contains a sentence, a score, and information about what each word is syntactically.

Figure: How learning from reading works.

This figure shows the process of learning from reading. First, the text is parsed and turned into syntactic frames. Then, through generalization and statistical aggregation, they are turned into semantic frames.

Searching for the Answer

Most of the algorithms in Watson were not novel techniques. For example, for the clue “He was presidentially pardoned on September 8, 1974,” the algorithm found that this sentence was looking for the subject. It then searched for possible subjects in semantic frames with similar words in them. Based on the syntactical breakdown done in the first step, it generated a set of possible answers. If one of the possible answers it found was “Nixon,” that would be considered a candidate answer. Next, Watson played a clever trick replacing the word “He” with “Nixon,” forming the new sentence “Nixon was presidentially pardoned on September 8, 1974.”

Then, it ran a new search on the generated semantic frame, checking to see if it was the correct answer. The search found a very similar semantic frame “Ford pardoned Nixon on September 8, 1974” with a high confidence score, so the candidate answer was also given a high score. But searching and getting a confidence score was not the only technique applied by Watson.

Evaluating Hypotheses

Evaluating hypotheses was another clever technique that Watson employed to help evaluate its answers. With the clue: “In cell division, mitosis splits the nucleus and cytokinesis splits this liquid cushioning the nucleus,” Watson searched for possible answers in the knowledge base that it acquired through reading. In this case, it found many candidate answers:

  • Organelle

  • Vacuole

  • Cytoplasm

  • Plasm

  • Mitochondria

Systematically, it tested the possible answers by creating an intermediate hypothesis, checking if the solutions fit the criterion of being liquid. It calculated the confidence of each one of the solutions being liquid using its semantic frames and the same search mechanism described above. The results had the following percentages:

  • is (“Cytoplasm”, “liquid”) = 0.2

  • is (“Organelle”, “liquid”) = 0.1

  • is (“Vacuole”, “liquid”) = 0.1

  • is (“Plasm”, “liquid”) = 0.1

  • is (“Mitochondria”, “liquid”) = 0.1

To generate these confidence scores, it searched through its knowledge base and, for example, found the semantic frame:

Cytoplasm is a fluid surrounding the nucleus.

It then checked to see if fluid was a type of liquid. To answer that, it looked at different resources, including WordNet, a lexical database of semantic relations between words, but did not find evidence showing that fluid is a liquid. Through its knowledge base, it learned that sometimes people consider fluid a liquid. With all that information, it created a possible answer set, with each answer having its own probability—a confidence score—assigned to it.

Cross-Checking Space and Time

Another technique Watson employed was to cross-check whether candidate answers made sense historically or geographically, checking to see which answers could be eliminated or changing the probability of a response being correct.

For example, for the clue: “In 1594, he took the job as a tax collector in Andalusia.” The two top answers generated by the first pass of the algorithm were “Thoreau” and “Cervantes.” When Watson analyzed “Thoreau” as a possible answer, it discovered that Thoreau was born in 1817, and at that point, Watson ruled that answer out because he was not alive in 1594.

Learning Through Experience

Jeopardy!’s questions are based in categories, limiting the scope of knowledge needed for each answer. Watson used that information to adjust its answer confidence. For example, in the category “Celebrations of the Month”, The first clue was “National Philanthropy Day and All Souls’ Day.” Based on its algorithm, Watson’s answer would be “Day of the Dead” because it classifies this category of the type “Day,” but the correct response was November. Because of that, Watson updated the category type to be a mix of “Day” and “Month,” which boosted answers that are of type “Month.” With time, Watson could update the type of response for a certain category.

Figure: IBM Watson updates the category type when its responses do not reflect the type of response for the correct answer. Then, it updates the possible category type based on the correct answers.

Practice Match

Figure: This image shows the evolution of different versions of IBM Watson throughout its different versions and upgrades.

These techniques were all employed together to make Watson perform at the highest level. In the beginning of 2011, IBM scientists decided that Watson was good enough to play against the best human opponent. They played a practice match before the press on January 13, 2011, and Watson won against Ken Jennings and Brad Rutter, two of the best Jeopardy! players. Watson ended the game with a score of $4,400, Ken Jennings with $3,400, and Brad Rutter with $1,200. Watson and Jennings were tied until the final question, worth $1,000—Watson won the game on that question. After the practice match, Watson was ready to play against the best humans in front of a huge audience on national television.

First Match

The first broadcasted match happened a month later on February 14, 2011, and the second match the next day. Watson won the first match but made a huge mistake. In the final round, Watson’s response in the US Cities category to the prompt “Its largest airport is named for a World War II hero; its second largest, for a World War II battle” was “What is Toronto??????” Alex Trebek, the host of Jeopardy! and a Canadian native, made fun of Watson, jokingly saying that he learned that Toronto was an American city.

David Ferrucci, the leading scientist, explained that Watson did not deal with structured databases, so it used US City as a clue to what the possible answer could include and that many American cities are named Toronto. Also, the Canadian baseball team, the Toronto Blue Jays, plays in the American Baseball League. That could be the reason why Watson considered Toronto to be one of the possible answers. Ferrucci also said that very often answers in Jeopardy! are not the types of things that are named in that category. Watson knew that, and so possibly considered that the category “US Cities” might be a clue to the answer. Watson used other elements to contribute to its response as well. The engineers also stated that its confidence was very low, which was indicated by the number of question marks after Watson’s answer. Watson had a 14% confidence percentage for “What is Toronto??????”. The correct answer, “What is Chicago?”, was a close second with an 11% confidence percentage. At the end of the first match, however, Watson had more than triple the money of the second-best competitor. Watson won with $35,734, Rutter with $10,400, and Jennings with $4,800.

Figure: David Ferrucci, the man behind Watson.

Second Match

To support Watson on the second day of the competition, one of the engineers wore a Toronto Blue Jays jacket. The game started, and Jennings chose the Daily Double clue. Watson responded incorrectly to the Daily Double clue for the first time in the two days of play. After the first round, Watson placed second for the first time in the competition. But in the end, Watson won the second match with $77,147; Jennings finished in second place with $24,000. IBM Watson made history as the first machine to win Jeopardy! against the best humans.

If you found this post worthwhile, please share!