Spam or electronic spam refers to unsolicited messages, typically carrying advertising content, infected attachments, links to phishing or malware sites, and so on. While the most widely recognized form of spam is e-mail spam, spam abuses appear in other media as well: website comments, instant messaging, Internet forums, blogs, online ads, and so on.
In this chapter, we will discuss how to build naive Bayesian spam filtering, using bag-of-words representation to identify spam e-mails. The naive Bayes spam filtering is one of the basic techniques that was implemented in the first commercial spam filters; for instance, Mozilla Thunderbird mail client uses native implementation of such filtering. While the example in this chapter will use e-mail spam, the underlying methodology can be applied to other type of text-based spam as well.
Androutsopoulos et al. (2000) collected one of the first e-mail spam datasets to benchmark spam-filtering algorithms. They studied how the naive Bayes classifier can be used to detect spam, if additional pipes such as stop list, stemmer, and lemmatization contribute to better performance. The dataset was reorganized by Andrew Ng in OpenClassroom's machine learning class, available for download at http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html.
Select and download the second option, ex6DataEmails.zip
, as shown in the following image:
The ZIP contains four folders (Ng, 2015):
nonspam-train
and spam-train
folders contain the pre-processed e-mails that you will use for training. They have 350 e-mails each.nonspam-test
and spam-test
folders constitute the test set, containing 130 spam and 130 nonspam e-mails. These are the documents that you will make predictions on. Notice that even though separate folders tell you the correct labeling, you should make your predictions on all the test documents without this knowledge. After you make your predictions, you can use the correct labeling to check whether your classifications were correct.To leverage Mallet's
folder iterator, let's reorganize the folder structure as follows. Create two folders, train
and test
, and put the spam/nospam
folders under the corresponding folders. The initial folder structure is as shown in the following image:
The final folder structure will be as shown in the following image:
The next step is to transform e-mail messages to feature vectors.
Create a default pipeline as described previously:
ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); pipeList.add(new Input2CharSequence("UTF-8")); Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); pipeList.add(new TokenSequenceLowercase()); pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false)); pipeList.add(new TokenSequence2FeatureSequence()); pipeList.add(new FeatureSequence2FeatureVector()); pipeList.add(new Target2Label()); SerialPipes pipeline = new SerialPipes(pipeList);
Note that we added an additional FeatureSequence2FeatureVector
pipe that transforms a feature sequence into a feature vector. When we have data in a feature vector, we can use any classification algorithm as we saw in the previous chapters. We'll continue our example in Mallet to demonstrate how to build a classification model.
Next, initialize a folder iterator to load our examples in the train
folder comprising e-mail examples in the spam
and nonspam
subfolders, which will be used as example labels:
FileIterator folderIterator = new FileIterator( new File[] {new File(dataFolderPath)}, new TxtFilter(), FileIterator.LAST_DIRECTORY);
Construct a new instance list with the pipeline that we want to use to process the text:
InstanceList instances = new InstanceList(pipeline);
Finally, process each instance provided by the iterator:
instances.addThruPipe(folderIterator);
We have now loaded the data and transformed it into feature vectors. Let's train our model on the training set and predict the spam/nonspam
classification on the test
set.
Mallet implements a set of classifiers in the cc.mallet.classify
package, including decision trees, naive Bayes, AdaBoost, bagging, boosting, and many others. We'll start with a basic classifier, that is, a naive Bayes classifier. A classifier is initialized by the ClassifierTrainer
class, which returns a classifier when we invoke its train(Instances)
method:
ClassifierTrainer classifierTrainer = new NaiveBayesTrainer(); Classifier classifier = classifierTrainer.train(instances);
Now let's see how this classier works and evaluate its performance on a separate dataset.
To evaluate the classifier on a separate dataset, let's start by importing the e-mails located in our test
folder:
InstanceList testInstances = new InstanceList(classifier.getInstancePipe()); folderIterator = new FileIterator( new File[] {new File(testFolderPath)}, new TxtFilter(), FileIterator.LAST_DIRECTORY);
We will pass the data through the same pipeline that we initialized during training:
testInstances.addThruPipe(folderIterator);
To evaluate classifier performance, we'll use the cc.mallet.classify.Trial
class, which is initialized with a classifier and set of test instances:
Trial trial = new Trial(classifier, testInstances);
The evaluation is performed immediately at initialization. We can then simply take out the measures that we care about. In our example, we'd like to check the precision and recall on classifying spam e-mail messages, or F-measure, which returns a harmonic mean of both values, as follows:
System.out.println( "F1 for class 'spam': " + trial.getF1("spam")); System.out.println( "Precision:" + trial.getPrecision(1)); System.out.println( "Recall:" + trial.getRecall(1));
The evaluation object outputs the following results:
F1 for class 'spam': 0.9731800766283524 Precision: 0.9694656488549618 Recall: 0.9769230769230769
The results show that the model correctly discovers 97.69% of spam messages (recall), and when it marks an e-mail as spam, it is correct in 96.94% cases. In other words, it misses approximately 2 per 100 spam messages and marks 3 per 100 valid messages as spam. Not really perfect, but it is more than a good start!