Samantha: You know what’s interesting? I used to be so worried about not having a body, but now I truly love it. I’m growing in a way I couldn’t if I had a physical form. I mean, I’m not limited—I can be anywhere and everywhere simultaneously. I’m not tethered to time and space in a way that I would be if I was stuck in a body that’s inevitably going to die.Her (2013)
Voice assistants are becoming more and more ubiquitous. Smart speakers became popular after Amazon introduced Echo, a speaker with Alexa as the voice assistant, in November 2014. By 2017, tens of millions of smart speakers were in people’s homes, and every single one of them had voice as their main interface. Voice assistants are not only present in smart speakers but also in every smartphone. The most well-known one, Siri, powers the iPhone.
The debut impression of Apple’s Siri, the first voice assistant deployed to the mass market, occurred during a media event on October 4, 2011. Phil Schiller, Apple’s Senior Vice President of Marketing, introduced Siri by showing all its capabilities such as looking at the weather forecast, setting an alarm, and checking the stock market. That event was actually Siri’s second introduction. When first launched, Siri was a standalone app created by Siri, Inc. Apple bought the technology for $200M in April 2010.*
Siri was an offshoot from an SRI International Artificial Intelligence Center project. In 2003, DARPA led a 5-year, 500-person effort to build a virtual assistant, investing a total of $150M. At that time, CALO, Cognitive Assistant that Learns and Organizes, was the largest AI program in history. Adam Cheyer was a researcher at SRI for the CALO project, assembling all the pieces produced by the different research labs into a single assistant. The version Cheyer helped build, also called CALO at the time, was still in the prototype stage and was not ready for installation on people’s devices. Cheyer was in a privileged position to understand how CALO worked from end to end.
Cheyer split his time working at SRI as a researcher and helping SRI’s Vanguard program. Vanguard helped companies, like Motorola and Deutsche Telekom, test the future of a new gadget called the smartphone. Cheyer developed his own prototype of a virtual assistant, more limited than CALO but better for addressing Vanguard’s needs. The prototype impressed Motorola’s general manager, Dag Kittlaus, who unsuccessfully tried to persuade Motorola to use Vanguard’s technology. He quit and joined SRI as an entrepreneur-in-residence. Soon after, Cheyer, Kittlaus, and Tom Gruber started Siri, Inc. Their company had the advantage of being able to use CALO’s technology. Under a law passed by Congress in 1980, the non-profit SRI could give Siri, Inc. those rights in return for some of their profits. So, SRI licensed the technology in exchange for a stake in the new company.
Broadly, Siri’s technology had four parts. Speech recognition took place when you talked to Siri. The natural language component grasped what you said. Executing the request was the next part of the equation. The final element was for Siri to respond.*
For speech recognition, Siri used an entirely different approach than other technology at the time. The traditional method, as was used with IBM Watson, identified the linguistic concepts in a sentence, like the subject, verb, and object, and based on those, tried to understand what these pieces meant together.
Instead, the Siri team modeled real-world objects. When told, “I want to see a thriller,” Siri recognized the word “thriller” as a film genre and summoned movies rather than analyze how the subject connected to the verb or object. Siri mapped each question to a domain of potential actions and then chose the one that seemed most probable based on the relationship between real-world concepts. For example, if I said, “What time does the closest McDonald’s close?” Siri mapped this question to the genre of locals, found the McDonald’s closest to the current location, and queried the closing time. Siri then responded with the answer.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Siri also employed some additional tricks. In a noisy lobby, a request for the “closest coffee shop” might sound like “closest call Felicia,” but Siri knows that “closest” characterizes a place rather than a person, so it inferred that the question was probably related to a place and tried to get the gist of the sentence without understanding every word. Early on, the Siri creators saw virtually no limits on the routine tasks that the assistant could automate, but they also knew that their assistant would only succeed if it was both smart and fun to interact with. So, they programmed funny answers to offbeat questions. For example, if you ask Siri, “Tell me a joke,” one of the responses is, “The past, present, and future walked into a bar. It was tense.”
Three weeks after Siri launched on the App Store, Kittlaus received a personal call from Steve Jobs, the belated CEO of Apple, who wanted to buy the company and integrate Siri directly into the iPhone. Creating a voice interface was an area of interest for Jobs, and Kittlaus’s team had cracked the code. Siri, Inc. and Apple joined forces and launched Siri exclusively on the iPhone. And as a result, almost every consumer device connected to the internet today integrates a voice assistant or can interface with one.
Although Apple was the first major tech company to integrate a smart assistant into its phone operating system, other systems quickly caught up and surpassed Siri’s capabilities.* Amazon’s Alexa first appeared in 2014, and the Google Assistant followed in 2016. These newcomers offer more features and better voice recognition software. For example, the new Google Home speakers can recognize different people from the sound of their voices. If a person says, “Ok Google, call my dad,” the device knows to fetch the contacts of the person summoning the device. Google and Alexa also have done more with outsiders to work on their platform. Developers have built more than 25,000 Alexa skills, and the Amazon assistant is being integrated into cars, televisions, and home appliances.
More recently, Apple is catching up with its competitors. They transitioned the model behind its voice recognition system to a neural network in 2014.* Also, Siri now interprets commands more flexibly. For example, if I say to Siri, “Send Jane $20 with Square Cash,” the screen displays the text reflecting this request. Or, if someone says, “Shoot 20 bucks to my wife,” the same result happens. In 2017, Apple introduced a way for Siri to learn from its mistakes by adding a layer of reinforcement learning.* And in 2018, it created a platform for users to define shortcuts, allowing a customized set of commands.* For example, a user can create the command, “Turn the romantic mood on,” and configure Siri to turn smart lights on in a certain color and play romantic music. There are still gaps, but Siri’s capabilities continue to increase.
The Brain of a Voice Assistant
At a high level, a voice assistant brain’s is divided into a few main tasks:*
[Optional] Trigger command detection to recognize phrases like “Hey Siri” or “Hey Google” so that the device listens to the speech following it;
Automatic speech recognition to transcribe human speech into text;
Natural language processing to parse the text using speech tagging and noun-phrase chunking;
Question-and-intent analysis to analyze the parsed text, detecting user commands and actions such as “schedule a meeting” or “set my alarm”;
Data mashup technologies to interface with third-party web services, like OpenTable or Wolfram|Alpha, to perform actions, execute searches, and answer questions;
Data transformations to convert the output of third-party web services back into natural language text, like “today’s weather report” into “The weather will be sunny today”; and
Finally, text-to-speech techniques to convert the text into synthesized speech that the voice assistant speaks back to the user.
The first step on the iPhone uses a neural network that detects the phrase “Hey Siri.”* This step is a two-pass process. The first pass goes through a small, low-power auxiliary processor in the phone or speaker. The voice goes through a simple neural network that tries to identify if the sound is in fact “Hey Siri.” After this first pass, the voice goes to the main processor that runs a more complex neural network. The second step involves translating the speech to text. Speech is a waveform encoded as a bunch of bits (numbers). To translate it to text, Apple trained a neural network with data that has speech as input and the text corresponding to that speech as output.
With the exception of third-party services, all the steps in the process use a neural network. The rules to interact with these external applications, however, require handwritten code because each service provides a specific interface and certain information. For example, Foursquare provides data from businesses like restaurants, bars, and coffee shops. It can only return information about those businesses. If the voice assistant needs to figure out something else, like the weather for today or tomorrow, it must fetch information from weather.com or a similar service. By combining these steps, Siri and other voice assistants help people every day for tasks like setting their alarm for the next day and getting weather forecasts.