Spam filtering using an ensemble of heterogeneous algorithms

We will use the SMS Spam Collection dataset from the UCI ML repository to create a spam classifier. Using the spam classifier, we can estimate the polarity of these messages. We can use various classifiers to classify the messages either as spam or ham. 

In this example, we opt for algorithms such as Naive Bayes, random forest, and support vector machines to train our models.

We prepare our data using various data-cleaning and preparation mechanisms. To preprocess our data, we will perform the following sequence:

  1. Convert all text to lowercase
  2. Remove punctuation
  3. Remove stop words
  4. Perform stemming
  5. Tokenize the data

We also process our data using term frequency-inverse data frequency (TF-IDF), which tells us how often a word appears in a message or a document. TF is calculated as:

TF = No. of times a word appears in a document / Total No. of words in the document

TF-IDF numerically scores the importance of a word based on how often the word appears in a document or a collection of documents. Simply put, the higher the TF-IDF score, the rarer the term. The lower the score, the more common it is. The mathematical representation of TD-IDF would be as follows:

tfidf(w,d,D)tf(t,d× idf(t,D)

where w represents the word, d represents a document and D represents the collection of documents.

In this example, we'll use the SMS spam collection dataset, which has labelled messages that have been gathered for cellphone spam research. This dataset is available in the UCI ML repository and is also provided in the GitHub repository.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset