Machine learning accomplishes much more than it theoretically should

Machine learning often uses lots of fancy math, but what it’s doing is very simple.

Let’s focus on classification, a common type of machine learning where the system decides between two alternatives. This includes spam filtering, face recognition, fraud detection, and many others.

Classification is a category of machine learning algorithms, including many different approaches to achieve this same goal.

Bias

Each algorithm makes assumptions about the way the world works. It will take in a bunch of emails labeled as spam or not, and then attempt to generalize the patterns observed in this data. Then, when it sees a new email, it can predict whether or not it’s spam.

The assumptions are different for each algorithm. One popular algorithm assumes that emails that are very similar to one another will tend to have the same label. And it also assumes you’ll see very similar emails when it’s learning as it will see when used later on in the real world.

Another popular algorithm assumes that a simple combination of a handful of attributes is enough to determine whether or not an email is spam. It could conclude whether a message is spam, for example, based on the five or six most important words in the email.

The assumptions an algorithm makes are called its inductive bias. This bias is typically very reasonable, and it tends to work well.

For example, consider the task of determining whether there is water at a specific location on a map of the world.

Given many example data points scattered across the map, assuming the answer is the same for nearby points would be a good assumption. It would make some mistakes, but for points out in the middle of the ocean or within a large land mass, this assumption would usually be right.

While the bias of an algorithm tends to work well, this isn’t always true.

For example, consider a particularly challenging problem where all odd numbers are blue and all even numbers are red. An algorithm trying to learn this pattern would struggle if it assumed all nearby numbers would have the same color.

Theory

A clever man named David Wolpert proved that these assumptions algorithms make pose a major challenge for machine learning researchers. It is impossible to come up with one algorithm that can be successful on every problem.

This development is called the No Free Lunch Theorem.

Framed another way, if you give me a machine learning algorithm, I can produce a dataset it performs poorly on.

This theorem is a fundamental issue for machine learning research since if no one algorithm can do well across all problems it becomes hard to even compare algorithms fairly.

Reality

Many people building machine learning systems simply try out a bunch of models and see which one is best on a specific problem. While machine learning practitioners develop a sense of when a particular algorithm may work well, there is no systematic solution to this problem.

When trying a variety of algorithms, though, the results are very different than you might expect from this theory.

Let’s consider a particular, unremarkable example of what typically happens. In a blog post, Vijaya Beeravalli tried many different algorithms on the same dataset for predicting whether a credit card customer would fail to make a payment.

Don’t worry about the names of the particular algorithms used. What is important is the range of accuracy scores across a wide range of algorithms. Here’s what he found:

Algorithm	Accuracy
Naïve Bayes	75.6%
K-Nearest-Neighbors	80.4%
Logistic Regression	80.9%
Support Vector Machine	80.9%
Linear Discriminant Analysis	80.9%
Random Forest	81.7%
Modified Gradient Boosting (XGBoost)	81.8%
Decision Tree	81.9%
Neural Network	82.0%
Gradient Boosting	82.1%

Out of 10 algorithms tried, 9 have accuracy scores between 80.4% and 82.1%, a range of less than two percentage points. The one algorithm outside this—with Naïve in its name and which frequently performs poorly across many tasks—trails these by almost 5%. Still, all ten algorithms are within a 6.5% range.

With a theory indicating algorithms have inherent biases and no one algorithm can do well at all problems, it should be surprising to see that every algorithm tried on a particular problem does pretty well.

A Structured World

Imagine being a newborn infant, whose eyes don’t yet focus. The world must seem like a pretty incomprehensible place. Everything must seem new and different, with little structure to make sense of.

With eyes that focus, however, we can see an incredible amount of structure in the world. Consider a particular tree, bug, or computer—these are well-defined objects that have a particular shape, a certain set of colors and textures, and many other properties we can identify.

For an algorithm working with image data, many different assumptions could capture aspects of this same structure. The algorithm could assume nearby pixels in an image tend to be related to one another. Or it could assume that things can often be identified with a few key properties. Or that similar things will tend to have many less important properties in common. Or that lines tend to indicate the boundaries of objects.

Some of these assumptions may be better than others, and combinations of these assumptions may be better than one individually, but many different assumptions may reveal the same structure from different perspectives.

Next, consider an algorithm working with text data. This seems like a completely different problem. When thinking of patterns in text, the most obvious that come to mind are the rules of putting words together into sentences: grammar.

Consider the following “sentence” formed by choosing seven random words: “Obtainable cushion occupy far-flung detailed exchange minister.” This doesn’t obey the rules of grammar, and is similar to imagining a completely unstructured world through the eyes of a newborn.

Grammar is not all of the structure needed for writing to make sense, though. Here’s a minimally-modified version of that same sentence which is entirely grammatical but nonsensical: “Obtainable cushions occupy the far-flung, detailed exchange minister.”

The following sentence makes much more sense: “The minister sits on a comfortable cushion.”

Why does this latter seem so much more plausible? We don’t typically think of cushions as being difficult to obtain; on the other hand, cushions are created specifically to be comfortable. Being occupied by a cushion seems odd, while sitting one is again using it for its intended purpose.

The words in a text denote objects, properties, actions, and other aspects of our structured world. The structure in text goes far beyond simply the rules of writing.

Algorithms working with text data are thus deeply connected to the image algorithm example. They both depend on the same sorts of structure that exists in the real world.

Math in Algorithms

Machine learning algorithms search for patterns in data. The math involved ranges from very simple to advanced depending on the particular algorithm, but having math is critical.

Recall that algorithms make assumptions, which as discussed previously form the inductive bias projecting how to extrapolate from data the model is trained on to new data.

The assumption each algorithm depends on is typically one chosen to allow math to be used within the framework of the assumption.

For example, an algorithm may assume a few key attributes can be used to classify whether or not an email is spam or not, that the best model will have the highest accuracy, etc. Math can then rigorously be applied to find a great model within the assumptions provided.

There is nothing that guarantees the assumptions are good; but, if the assumptions are reasonable, math can enable executing within the framework provided by the assumptions.

Math can then provide meaningful guarantees about the quality of the model produced based on the process used to obtain it. For example, it can be possible to calculate the amount of data needed to generate an effective model.

This math almost always drastically underestimates the performance of algorithms; they typically produce effective results with much less data than such a calculation indicates.

Math is critical: it enables rigorous processes with strong theoretical guarantees that it’ll work, which in turn allows us to build machine learning algorithms.

Conclusion

Algorithms have to make assumptions in order to generalize from data seen when the model is built to new data the model is used on. Theories tell us that it’s impossible to build a perfect model that will work on every conceivable problem, and that we need tons of data to get decent performance.

The world has more structure to it than an algorithm’s assumptions are able to take advantage of, though. If we could model all the complexity of the world and use that for machine learning, we might get much more accurate results—but then with a full model of the world we probably wouldn’t need machine learning!

The world is very structured. Machine learning is not able to take full advantage of little of this in theory, but it does gain the advantage of that structure in practice as it actually picks up on patterns in the data that reflect this real-world structure.

This results in things working better, or with much less data, than the math would indicate.

So, machine learning does much more than it theoretically should. And the theory that powers it may seem inadequate from this perspective.

Despite this apparent weakness, these theories have laid the foundation for many of the advances we see in the world around us today. They enable technologies like digital assistants and automatic translators that seemed like science fiction only a decade ago.

We should be thankful for these weak theories, and the structure in the world that enables them to work so well in practice.