Spam filtering using an ensemble of heterogeneous algorithms

We will use the SMS Spam Collection dataset from the UCI ML repository to create a spam classifier. Using the spam classifier, we can estimate the polarity of these messages. We can use various classifiers to classify the messages either as spam or ham.

In this example, we opt for algorithms such as Naive Bayes, random forest, and support vector machines to train our models.

We prepare our data using various data-cleaning and preparation mechanisms. To preprocess our data, we will perform the following sequence:

Convert all text to lowercase
Remove punctuation
Remove stop words
Perform stemming
Tokenize the data

We also process our data using term frequency-inverse data frequency (TF-IDF), which tells us how often a word appears in a message or a document. TF is calculated as:

TF = No. of times a word appears in a document / Total No. of words in the document

TF-IDF numerically scores the importance of a word based on how often the word appears in a document or a collection of documents. Simply put, the higher the TF-IDF score, the rarer the term. The lower the score, the more common it is. The mathematical representation of TD-IDF would be as follows:

tfidf(w,d,D)= tf(t,d) × idf(t,D)

where w represents the word, d represents a document and D represents the collection of documents.

In this example, we'll use the SMS spam collection dataset, which has labelled messages that have been gathered for cellphone spam research. This dataset is available in the UCI ML repository and is also provided in the GitHub repository.

Table of Contents for Spam filtering using an ensemble of heterogeneous algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Spam filtering using an ensemble of heterogeneous algorithms