Hyperplanes and You: Support Vector Machines

supraspatial decisionmaking

A core data science task is classification: grouping data points into various groups based on certain shared qualities.

In a sense, it’s an exercise as old as life itself: as soon as the first protozoan developed sensory organs, it (accidentally) started to act differently based on various sensory stimulus.

On a higher biological level, it’s a monkey looking at an object hanging from a branch and deciding “food” or “not food”.
On a machine level, it’s your ML model combing through credit transactions and deciding “fraud” or “not fraud”.

You’ve probably heard of clustering as a technique for classification; it’s easy enough to visualize on a two-dimensional graph, or even with a Z axis added in.

It’s intuitive, since we move about in three, maybe four dimensions.

But your data may be a little more complex than that (as far as axes are concerned), and the moment you have 4 columns in your table, you’re in high-dimensional space.

How do you draw balanced class distinctions in data with 70 features? One clever way is support vector machines, a geometric classification technique involving hyperplanes — which can be thought of as “decisionmaking boundaries”.

Multiplanar Thinking

In short, SVMs classify data points by drawing hyperplanes to maximize the overall distance between classes.

Hyperplanes are much simpler than they sound: a “subspace whose dimension is one less than that of its ambient space”.
In our previous 2D examples a hyperplane is a 1D line. In a graph with a Z axis, we’d have a 2D plane.

Let’s start with two dimensions. There’s plenty of lines you could draw to separate these points into two classes:

But some lines sort of… feel better than others, don’t they?

That’s your brain performing a bunch of visual-distance estimations & subconscious calculations. A lot of neurons are firing in some very complex ways to “balance” things intuitively.

SVMs are a way of mathematically formalizing this balancing. A hyperplane (in this case, a line) that feels good to you is likely one that maximizes the margins (overall distance) between itself and the closest data points of each class.

The reason we search for balanced classifiers is that the real world doesn’t always look like our training data, so we want our model to generalize well — it should learn enough from the dataset without overfitting on the unique minutiae.

To do so, we draw some support vectors to figure out where the optimal hyperplane lies, and maximize the margins between the vectors.

Getting Tricky

But how can we deal with messier data that doesn’t seem to fit neatly into a linear classification?

The simple answer is “take your data to the next dimension” in a very literal sense. Welcome to the kernel trick.

Kernel methods create non-linear combinations of the base data, and project them onto a higher dimension where a cleaner hyperplane can be drawn.

Remember, your computer has no issues dealing with high-dimensional reasoning. So you can quite often end up with something that looks like this:

This process works just as well in 10 dimensions, although it’s much harder to visualize.

The best way to start coding one of these out is the official scikit-learn examples. It’s quite smooth, really — most of this code is plotting the results in an intuitive fashion.

This example demonstrates 3 types of kernels: Radial Basis Function, Linear & Polynomial. X is an ndarray of shape (40,1), y is (40). We don’t bother too much with tuning the parameters C, gamma, or epsilon.

Tuning is actually fairly important for SVMs, especially C, which dictates how “sensitive” the machine is when finding an optimal hyperplane.

C is essentially a ‘slack’ factor that measures how many misclassifications are allowed as a tradeoff for bigger margins. A smaller C can usually help with overfitting, since it leads to more general, larger margins.

Saber Moazami gives a neat example of these differences. He plots a neat grid search to minimize mean squared error while balancing C, gamma and epsilon against each other.

Overall, SVMs are a powerful tool for classification. Scikit-learn’s SVC functions come with an additional probability parameter, which returns probability scores for each class based on marginal distance, allowing for SVM-based prediction models.

data scientist, machine learning engineer. passionate about ecology, biotech and AI. https://www.linkedin.com/in/mark-s-cleverley/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store