Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.Edward Snowden*
In 2014, Tim received a request on his Facebook app to take a personality quiz called “This Is Your Digital Life.” He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task, so he quickly accepted the invitation. Within five minutes of receiving the request on his phone, Tim logged in to the app, giving the company in charge of the quiz access to his public profile and all his friends’ public profiles. He completed the quiz within 10 minutes. A UK research facility collected the data, and Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.
What Tim did not know was that he had just shared his and all of his friends’ data with Cambridge Analytica. This company used Tim’s data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information such as age, income, and gender, psychographic profiles explain why people make purchases. The use of personal data on such a scale made this scheme, which Tim passively participated in, one of the biggest political scandals to date.
Data has become an essential part of deep learning algorithms.* Large corporations now store a lot of data from their users because that has become such a central part of building better models for their algorithms and, in turn, improving their products. For Google, it is essential to have users’ data in order to develop the best search algorithms. But as companies gather and keep all this data, it becomes a liability for them. If a person has pictures on their phone that they do not want anyone else to see, and if Apple or Google collects those pictures, their employees could have access to them and abuse the data. Even if these companies protect against their own employees having access to the data, a privacy breach could occur, allowing hackers access to people’s private data.
Hacks resulting in users’ data being released are very common. Every year, it seems the number of people affected by a given hack increases. One Yahoo hack compromised 3 billion people’s accounts.* So, all the data these companies have about their users becomes a burden. At other times, data is given to researchers, expecting the best of their intentions. But researchers are not always sensitive when handling data. That was the case with the Cambridge Analytica scandal. In that instance, Facebook provided researchers access to information about users and their friends that was mainly in their public profiles, including people’s names, birthdays, and interests.* This private company then used the data and sold it to political campaigns to target people with personalized ads based on their information.
Differential privacy is a way of obtaining statistics from a pool of data from many people without revealing what data each person provided.
Keeping sensitive data or giving data directly to researchers for creating better algorithms is dangerous. Personal data should be private and stay that way. As far back as 2006, researchers at Microsoft were concerned about users’ data privacy and created a breakthrough technique called the differential privacy, but they never used it in their products. Ten years later, Apple released products on the iPhone using this same method.
Figure: How differential privacy works.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Apple implements one of the most private versions of differential privacy, called the local model. It adds noise to the data directly on the user’s device before sending it to Apple’s servers. In that way, Apple never touches the user’s true data, preventing anyone other than the user from having access to it. Researchers can analyze trends of people’s data but are never able to access the details.*
Differential privacy does not merely try to make users’ data anonymous. It allows companies to collect data from large datasets, with a mathematical proof that no one can learn about a single individual.*
Imagine that a company wanted to collect the average height of their users. Anna is 5 feet 6 inches, Bob is 5 feet 8 inches, and Clark is 5 feet 5 inches. Instead of collecting the height individually from each user, Apple collects the height plus or minus a random number. So, it would collect 5 feet 6 inches plus 1 inch for Anna, 5 feet 8 inches plus 2 inches for Bob, and 5 feet 5 inches minus 3 inches for Clark, which equals 5 feet 7 inches, 5 feet 10 inches, and 5 feet 2 inches, respectively. Apple averages these heights without the names of the users.
The average height of its users would be the same before and after adding the noise: 5 feet 6 inches. But Apple would not be collecting anyone’s actual height, and their individual information remains secret. That allows Apple and other companies to create smart models without collecting personal information from its users, thus protecting their privacy. The same technique could produce models about images on people’s phones and any other information.
Differential privacy, or keeping users’ data private, is much different from anonymization. Anonymization does not guarantee that information the user has, like a picture, is not leaked or that the individual cannot be traced back from the data. One example is to send a pseudonym of a person’s name but still transmit their height. Anonymization tends to fail. In 2007, Netflix released 10 million movie ratings from its viewers in order for researchers to create a better recommendation algorithm. They only published ratings, removing all identifying details.* Researchers, however, matched this data with public data on the Internet Movie Database (IMDb).* After matching patterns of recommendations, they added the names back to the original anonymous data. That is why differential privacy is essential. It is used to prevent user’s data from being leaked in any possible way.
Figure: Emoji usage across different languages.
This figure shows the usage percentage of each emoji over the total usage of emojis for English- and French-speaking countries. The data was collected using differential privacy. The distribution of the usage of emojis in English-speaking countries differs from that of French-speaking nations. That might reveal underlying cultural differences that translate to how each culture uses language. In this case, how frequently they use each emoji is interesting.
Apple started using differential privacy to improve its predictive keyboard,* the Spotlight search, and the Photos app. It was able to advance these products without obtaining any specific user’s data. For Apple, privacy is a core principle. Tim Cook, Apple’s CEO, has time and time again called for better data privacy regulation.* The same data and algorithms that can be used to enhance people’s lives can be used as a weapon by bad actors.
Apple predictive keyboard, with data it collected with differential privacy, helps users by offering the next word that should be in the text based on its models. Apple has also been able to create models for what is inside people’s pictures on their iPhones without having actual users’ data. It is possible for users to search for specific items like “mountains,” “chairs,” and “cars” in their pictures. And, all of that is served by models developed using differential privacy. Apple is not the only one using differential privacy in its products. In 2014, Google released a system for its Chrome web browser to figure out users’ preferences without invading their privacy.
But Google has also been working with other technologies to produce better models while continuing to keep users’ data private.
Google developed another technique called federated learning.* Instead of collecting statistics on users, Google developed an in-house model and then deployed it to each of the users’ computers, phones, and applications. Then, the model is trained based on the data generated by the user or that is already present.
For example, if Google wants to create a neural network to identify objects in pictures and has a model of how “cats” look but not how “dogs” look, then the neural network is sent to a user’s phone that contains many pictures of dogs. From that, it learns what dogs look like, updating its weights. Then, it summarizes all of the changes in the model as a small, focused update. The update is sent to the cloud, where it is averaged with other users’ updates to improve the shared model. Everyone’s data advances the model.
Federated learning* works without the need to store user data in the cloud, but Google is not stopping there. They have developed a secure aggregation protocol that uses cryptographic techniques so that they can only decrypt the average update if hundreds or thousands of users have participated.* That guarantees that no individual phone’s update can be inspected before averaging it with other users’ data, thus guarding people’s privacy. Google already uses this technique in some of its products, including the Google keyboard that predicts what users will type. The product is well-suited for this method since users type a lot of sensitive information into their phone. The technique keeps that data private.
This field is relatively new, but it is clear that these companies do not need to keep users’ data to create better and more refined deep learning algorithms. In the years to come, more hacks will happen, and users’ data that has been stored to improve these models will be shared with hackers and other parties. But that does not need to be the norm. Privacy does not necessarily need to be traded to get better machine learning models. Both can co-exist.