Support Vector Machines

We're going to be utilizing a new classifier in this chapter, a linear Support Vector Machine (SVM). An SVM is an algorithm that attempts to linearly separate data points into classes using a maximum-margin hyperplane. That's a mouthful, so let's look at what it really means.

Suppose we have two classes of data, and we want to separate them with a line. (We'll just deal with two features, or dimensions, here.) What is the most effective way to place that line? Lets have a look at an illustration:

In the preceding diagram, line H1 does not effectively discriminate between the two classes, so we can eliminate that one. Line H2 is able to discriminate between them cleanly, but H3 is the maximum-margin line. This means that the line is centered between the two nearest points of each class, which are known as the support vectors. These can be seen as the dotted lines in the following diagram:

What if the data isn't able to be separated into classes so neatly? What if there is an overlap between the points? In that situation, there are still options. One is to use what's called a soft-margin SVM. This formulation still maximizes the margin, but with the trade-off being a penalty for points that fall on the wrong side of the margin. The other option is to use what's called the kernel trick. This method transforms the data into a higher dimensional space where the data can be linearly separated. An example is provided here:

The two-dimensional representation is a follows:

We have taken a one-dimensional feature space and mapped it onto a two-dimensional feature space. The mapping simply takes each x value and maps it to x, x2. Doing so allows us to add a linear separating plane.

With that covered, let's now feed our tf-idf matrix into our SVM:

from sklearn.svm import LinearSVC 
 
clf = LinearSVC() 
model = clf.fit(tv, df['wanted']) 

tv is our matrix, and df['wanted'] is our list of labels. Remember this is either y or n, denoting whether we are interested in the article. Once that runs, our model is trained.

One thing we aren't doing in this chapter is formally evaluating our model. You should almost always have a hold-out set to evaluate your model against, but because we are going to be continuously updating our model, and evaluating it daily, we'll skip that step for this chapter. Just remember that this is generally a terrible idea.

Let's now move on to setting up our daily feed of news items.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset