Spam filtering

Our first problem is a modern version of the canonical binary classification problem: spam classification. In our version, however, we will classify spam and ham SMS messages rather than e-mail. We will extract TF-IDF features from the messages using techniques you learned in Chapter 3, Feature Extraction and Preprocessing, and classify the messages using logistic regression.

We will use the SMS Spam Classification Data Set from the UCI Machine Learning Repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the data set and calculate some basic summary statistics using pandas:

>>> import pandas as pd
>>> df = pd.read_csv('data/SMSSpamCollection', delimiter='	', header=None)
>>> print df.head()

      0                                                  1
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
[5 rows x 2 columns]

>>> print 'Number of spam messages:', df[df[0] == 'spam'][0].count()
>>> print 'Number of ham messages:', df[df[0] == 'ham'][0].count()

Number of spam messages: 747
Number of ham messages: 4825

A binary label and a text message comprise each row. The data set contains 5,574 instances; 4,827 messages are ham and the remaining 747 messages are spam. The ham messages are labeled with zero, and the spam messages are labeled with one. While the noteworthy, or case, outcome is often assigned the label one and the non-case outcome is often assigned zero, these assignments are arbitrary. Inspecting the data may reveal other attributes that should be captured in the model. The following selection of messages characterizes both of the classes:

Spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Spam: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
Ham: Sorry my roommates took forever, it ok if I come by now?
Ham: Finished class where are you.

Let's make some predictions using scikit-learn's LogisticRegression class:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model.logistic import LogisticRegression
>>> from sklearn.cross_validation import train_test_split, cross_val_score

First, we load the .csv file using pandas and split the data set into training and test sets. By default, train_test_split() assigns 75 percent of the samples to the training set and allocates the remaining 25 percent of the samples to the test set:

>>> df = pd.read_csv('data/SMSSpamCollection', delimiter='	', header=None)
>>> X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])

Next, we create a TfidfVectorizer. Recall from Chapter 3, Feature Extraction and Preprocessing, that TfidfVectorizer combines CountVectorizer and TfidfTransformer. We fit it with the training messages, and transform both the training and test messages:

>>> vectorizer = TfidfVectorizer()
>>> X_train = vectorizer.fit_transform(X_train_raw)
>>> X_test = vectorizer.transform(X_test_raw)

Finally, we create an instance of LogisticRegression and train our model. Like LinearRegression, LogisticRegression implements the fit() and predict() methods. As a sanity check, we printed a few predictions for manual inspection:

>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> predictions = classifier.predict(X_test)
>>> for i, prediction in enumerate(predictions[:5]):
>>>     print X_test_raw.values[i], 'prediction:', prediction

The following is the output of the script:

Prediction: ham. Message: If you don't respond imma assume you're still asleep and imma start calling n shit
Prediction: spam. Message: HOT LIVE FANTASIES call now 08707500020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national rate call
Prediction: ham. Message: Yup... I havent been there before... You want to go for the yoga? I can call up to book 
Prediction: ham. Message: Hi, can i please get a  <#>  dollar loan from you. I.ll pay you back by mid february. Pls.
Prediction: ham. Message: Where do you need to go to get it?

How well does our classifier perform? The performance metrics we used for linear regression are inappropriate for this task. We are only interested in whether the predicted class was correct, not how far it was from the decision boundary. In the next section, we will discuss some performance metrics that can be used to evaluate binary classifiers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset