Chad Mills

The data AI learns from

I recently wrote about the “input data” humans use for learning. This week I’m turning my attention to computers, showing what the input data to artificial intelligence systems looks like.

We’ll see that for some simple but abstract tasks like spam filtering, computers have advanced input data painstakingly crafted by humans, while more modern algorithms have made it possible to achieve similar goals using much larger quantities of raw, unprocessed data.

Most AI has historically worked from advanced data

Until the last few years, the vast majority of AI systems were narrow, targeted systems. They do one simple thing, like filtering spam, recommending a movie to watch, or deciding how likely you are to click on an advertisement.

How does a spam filter work?

Well, Paul Graham famously had a plan for spam, which basically just counted up the number of each word in an email and calculated some statistics to predict if an email is spam or not. That works great until spammers understand how this system works and adapt the words they use.

Similar technology motivated Bill Gates to claim spam would be solved in 2 years—back in 2004. I was on the team he created to solve the problem.

We were still working on spam (and many other problems) over a decade later. Fortunately, we did dramatically reduce the amount of spam out there. It wasn’t solved, but it was substantially mitigated.

The successful ways to solve this involve carefully analyzing how spammers make their money, how they conduct their attacks, and collect just the right data so that AI-based systems can easily spot problems.

For example, it involves looking at where spam comes from—what computers are sending it, where are they located, how much mail have they sent before, how often have users read them, how similar are their messages, etc.

Advanced data helps AI work better

In this spam filtering case, it’s easy to just look at which words appear in an email and predict whether the message is spam from that. But the spammers learn what words are good, use them, and beat the filter. And when the filter catches up, they switch to whichever new words it thinks are good.

The more advanced decision factors, like how similar messages from the same computer look, are much stronger signals whether a message is spam or not. Spammers make a small fraction of a cent for each email they send, so they depend on sending in bulk. It’s smart to leverage this to decide if an email is spam.

These more advanced attributes are more stable over time and across spam campaigns. They generalize well. They’re fundamentally hard for the spammers to work around.

There are other factors like this. Sites spammers link to can be detected and blocked, so they frequently need to setup new sites; the age of the site is an important attribute. Most email senders have been sending for a while, so the sending history is also helpful.

How would a computer figure this out? The algorithms need a well-defined input dataset provided by humans. And most of this data isn’t in the email itself—it’s outside data you can get and link to information in the email. Every problem will require different insights, so automating this essentially means having computers understand all human knowledge—something as science fiction as it gets.

A human can tell at a glance whether a message is spam. Computers need humans to figure out which factors are the most important to pay attention to. AI can then infer from data how much to rely on each factor, but it needs high-quality data to get started.

Sensors are primitive

In more recent times, a new technology called deep learning has taken over the field of machine learning, by far the most successful subfield of artificial intelligence.

This technology is capable of amazing things, and it first rose to prominence in a widely-researched contest to recognize objects in images.

The input to these systems are images. This is raw, unprocessed input. Image sensors record a color for each pixel in an image. Grouping pixels together, figuring out which correspond to shapes, and then identifying the object in the image is a challenging problem for a computer.

Deep learning has enabled systems to do well without a human choosing the most important factors like in the spam filtering case. In order for these systems to work, they need massive amounts of data.

They start learning from what amounts to a bunch of disconnected sensations: pixels in an image. It’s the algorithm’s job not just to learn from well-chosen data but also to put the little sensations together into coherent objects. This is an incredibly challenging starting point to work from.

At present, these algorithms are pretty good at recognizing objects in images, and even classifying which human concepts they fall under—like dog, horse, tree, etc. This includes some specific concepts like the monkey-bread tree.

Note that this is feasible because it doesn’t depend on bringing in outside information. These algorithms rely on shapes, but they learn these from many other images they’re provided where the objects are identified by humans.

These systems generally work best when all the data needed to make the decision is self-contained, without needing outside resources. They especially work well for perception-related tasks, like object recognition or speech recognition.


There are two very different approaches to providing data to AI algorithms.

For problems where learning requires external knowledge not self-contained in a restricted dataset, the data AI algorithms use is advanced—but only because humans painstakingly craft special datasets for each application, figuring out what signals the algorithm can use to make decisions and making it easy for the AI.

On the other hand, more modern algorithms are able to work from raw data like those from images or audio recordings—but only for self-contained tasks like this that don’t require substantial outside knowledge of the world.

In a future post, I’ll analyze the differences between the input data humans and computers use for their learning, including some implications for how we should think about artificial intelligence and how it compares to human intelligence.

About author View all posts Author website


Chad currently leads applied research, ML engineering, and computational linguistics teams at Grammarly.

He's previously led ML and data science teams at companies large and small, including working on News Feed at Facebook and on Windows and Outlook at Microsoft.

Leave a Reply

Your email address will not be published. Required fields are marked *