Data Is the New Oil

16 minutes, 7 links


Updated November 2, 2022

You’re reading an excerpt of Making Things Think: How AI and Deep Learning Power the Products We Use, by Giuliano Giacaglia. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, plus future updates.

Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.Clive Humby*

Data is key to deep learning, and one of the most important datasets, ImageNet, created by Fei-Fei Li, marked the beginning of the field. It is used for training neural networks as well as to benchmark them against others.

Deep learning is a revolutionary field, but for it to work as intended, it requires data.* The term for these large datasets and the work around them is Big Data, which refers to the abundance of digital data. Data is as important for deep learning algorithms as the architecture of the network itself, the software. Acquiring and cleaning the data is one of the most valuable aspects of the work. Without data, neural networks cannot learn.*

Most of the time, researchers can use the data given to them directly, but there are many instances where the data is not clean. That means it cannot be used directly to train the neural network because it contains data that is not representative of what the algorithm wants to classify. Perhaps it contains bad data, like black-and-white images when you want to create a neural network to locate cats in colored images. Another problem is when the data is not appropriate. For example, when you want to classify images of people as male or female. There might be pictures without the needed tag or pictures that have the information corrupted with misspelled words like “ale” instead of “male.” Even though these might seem like crazy scenarios, they happen all the time. Handling these problems and cleaning up the data is known as data wrangling.

Researchers also sometimes have to fix problems with how data is represented. In some places, the data might be expressed one way, and in others the same data can be described in a completely different way. For example, a disease like diabetes might be classified with a certain number (3) in one database and (5) in another. This is one reason for the considerable effort in industries to create standards for sharing data more easily. For example, Fast Healthcare Interoperability Resources (FHIR) was created by the international health organization, Health Level Seven International, to create standards for exchanging electronic health records.

Standardizing data is essential, but selecting the correct input is also important because the algorithm is created based on the data.* And, choosing that data is not easy. One of the problems that can occur when selecting data is that it can be biased in some way, creating a problem known as selection bias. That means that the data used to train the algorithm does not necessarily represent the entire space of possibilities. The saying in the industry is, “Garbage in, garbage out.” That means that if the data entered into the system is not correct, then the model will not be accurate.


Fei-Fei Li, who was the director of the Stanford Artificial Intelligence Laboratory and also the Chief Scientist of AI/ML at Google Cloud, could see data was essential to the development of machine learning algorithms early on,* before many of her colleagues.

Figure: Professor Fei-Fei Li.

Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Now Available

Li realized that to make better algorithms and more performant neural networks, more and better data was needed and that better algorithms would not come without that data. At the time, the best algorithms could perform well with the data that they were trained and tested with, which was very limited and did not represent the real world. She realized that for the algorithms to perform well, data needed to resemble actuality. “We decided we wanted to do something that was completely historically unprecedented,” Li said, referring to a small team initially working with her. “We’re going to map out the entire world of objects.”

To solve the problem, Li constructed one of the most extensive datasets for deep learning to date, ImageNet. The dataset was created, and the paper describing the work was published in 2009 at one of the key computer vision conferences, Computer Vision and Pattern Recognition (CVPR), in Miami, Florida. The dataset was very useful for researchers and because of that, it became more and more famous, providing the benchmark for one of the most important annual deep learning competitions, which tested and trained algorithms to identify objects with the lowest error rate. ImageNet became the most significant dataset in the computer vision field for a decade and also helped boost the accuracy of algorithms that classified objects in the real world. In only seven years, the winning algorithms’ accuracy in classifying objects in images increased from 72% to nearly 98%, overtaking the average human’s ability.

But ImageNet was not the overnight success many imagine. It required a lot of sweat from Li, beginning when she taught at the University of Illinois Urbana-Champaign. She was dealing with problems that many other researchers shared. Most of the algorithms were overtraining to the dataset given to them, making them unable to generalize beyond it. The problem was that most of the data presented to these algorithms did not contain many examples, so they did not have enough information about all the use cases for the models to work in the real world. She, however, figured out that if she generated a dataset that was as complex as reality, then the models should perform better.

It is easier to identify a dog if you see a thousand pictures of different dogs, at different camera angles and in lighting conditions, than if you only see five dog pictures. In fact, it is a well-known rule of thumb that algorithms can extract the right features from images if there are around 1,000 images for a certain type of object.

Li started looking for other attempts to create a representation of the real world, and she came across a project, WordNet, created by Professor George Miller. WordNet was a dataset with a hierarchical structure of the English language. It resembled a dictionary, but instead of having an explanation for each word, it had a relation to other words. For example, the word “monkey” is underneath the word “primate,” which is in turn underneath the word “mammal.” In this way, the dataset contained the relation of all the words among others.

After studying and learning about WordNet, Li met with Professor Christiane Fellbaum, who worked with Miller on WordNet. She gave Li the idea to add an image and associate it to each word, creating a new hierarchical dataset based on images instead of words. Li expanded on the idea—instead of adding one image per word, she added many images per word.

As an assistant professor at Princeton, she built a team to tackle the ImageNet project. Li’s first idea was to hire students to find images and add them to ImageNet manually. But she realized that it would become too expensive and take too much time for them to finish the project. From her estimates, it would take a century to complete the work, so she changed strategies. Instead, she decided to get the images from the internet. She could write algorithms to find the pictures, and humans would choose the correct ones. After months working on this idea, she found that the problem with this strategy was that the images chosen were constrained to the algorithms that picked the images. Unexpectedly, the solution came when Li was talking to one of her graduate students, who mentioned a service that allows humans anywhere in the world to complete small online tasks very cheaply. With Amazon Mechanical Turk, she found a way to scale and have thousands of people find the right images for not too much money.

Amazon Mechanical Turk was the solution, but a problem still existed. Not all the workers spoke English as their first language, so there were issues with specific images and the words associated with them. Some words were harder for these remote workers to identify. Not only that, but there were words like “babuin” that confused workers—they did not exactly know which images represented the word. So, her team created a simple algorithm to figure out how many people had to look at each image for a given word. Words that were more complex like “babuin” required more people to check images, and simpler words like “cat” needed only a few people.

With Mechanical Turk, creating ImageNet took less than three years, much less than the initial estimate with only undergraduates. The resulting dataset had around 3 million images separated into about 5,000 “words.” People were not impressed with her paper or dataset, however, because they did not believe that more and more refined data led to better algorithms. But most of these researchers’ opinions were about to change.

The ImageNet Challenge

To prove her point, Li had to show that her dataset led to better algorithms. To achieve that, she had the idea of creating a challenge based on the dataset to show that the algorithms using it would perform better overall. That is, she had to make others train their algorithms with her dataset to show that they could indeed perform better than models that did not use her dataset.

The same year she published the paper in CVPR, she contacted a researcher named Alex Berg and suggested that they work together to publish papers to show that algorithms using the dataset could figure out whether images contained particular objects or animals and where they were located in the picture. In 2010 and 2011, they published five papers using ImageNet.* The first became the benchmark of how algorithms would perform on these images. To make it the benchmark for other algorithms, Li reached out to the team supporting one of the most well-known image recognition dataset and benchmark standards, PASCAL VOC. They agreed to work with Li and added ImageNet as a benchmark for their competition. The competition used a dataset called PASCAL that only had 20 classes of images. By comparison, ImageNet had around 5,000 classes.

As Li predicted, the algorithms that were trained using the ImageNet dataset performed better and better as the competition continued. Researchers learned that algorithms started performing better for other datasets when the models were first trained using ImageNet and then fine-tuned for another task. A detailed discussion on how this worked for skin cancer is in a later section.

A major breakthrough occurred in 2012. The creator of deep learning, Geoffrey Hinton, together with Ilya Sutskever and Alex Krizhevsky submitted a deep convolutional neural network architecture called AlexNet—still used in research to this day—“which beat the field by a whopping 10.8 percentage point margin.”* That marked the beginning of deep learning’s boom, which would not have happened without ImageNet.

ImageNet became the go-to dataset for the deep learning revolution and, more specifically, that of the convolution neural networks (CNNs) led by Hinton. ImageNet not only led the deep learning revolution but also set a precedent for other datasets. Since its creation, tens of new datasets were introduced with more abundant data and more precise classification. Now, they allow researchers to create better models. Not only that, but research labs have focused on releasing and maintaining new datasets for other fields like the translation of texts and medical data.

Figure: Inception Module included in GoogleNet.

In 2015, Google released a new convolutional neural network called Inception or GoogleNet.* It contained fewer layers than the top performing neural networks, but it performed better. Instead of adding one filter per layer, Google added an Inception Module, which includes a few filters that run in parallel. It showed once again that the architecture of neural networks is important.

Figure: ImageNet Top-5 accuracy over time. Top-5 accuracy asks whether the correct label is in at least the classifier’s top five predictions.

ImageNet is considered solved, reaching an error rate lower than the average human and achieving superhuman performance for figuring out if an image contains an object and what kind of object that is. After nearly a decade, the competition to train and test models on ImageNet. Li tried to remove the dataset from the internet, but big companies like Facebook pushed back since they used it as their benchmark.

But since the ending of the ImageNet competition, many other datasets have been created based on millions of images, voice clips, and text snippets entered and shared on their platforms every day. People sometimes take for granted that these datasets, which are intensive to collect, assemble, and vet, are free. Being open and free to use was an original tenet of ImageNet that will outlive the challenge and likely even the dataset. “One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research,” Li said. “People really recognize the importance the dataset is front and center in the research as much as algorithms.”

Data Privacy

Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.Edward Snowden*

In 2014, Tim received a request on his Facebook app to take a personality quiz called “This Is Your Digital Life.” He was offered a small amount of money and had to answer just a few questions about his personality. Tim was very excited to get money for this seemingly easy and harmless task, so he quickly accepted the invitation. Within five minutes of receiving the request on his phone, Tim logged in to the app, giving the company in charge of the quiz access to his public profile and all his friends’ public profiles. He completed the quiz within 10 minutes. A UK research facility collected the data, and Tim continued with his mundane day as a law clerk in one of the biggest law firms in Pennsylvania.

What Tim did not know was that he had just shared his and all of his friends’ data with Cambridge Analytica. This company used Tim’s data and data from 50 million other people to target political ads based on their psychographic profiles. Unlike demographic information such as age, income, and gender, psychographic profiles explain why people make purchases. The use of personal data on such a scale made this scheme, which Tim passively participated in, one of the biggest political scandals to date.

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!