Naive Bayes SMS spam classification example

Naive Bayes classifier has been developed using the SMS spam collection data available at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. In this chapter, various techniques available in NLP techniques have been discussed to preprocess prior to build the Naive Bayes model:

>>> import csv 
 
>>> smsdata = open('SMSSpamCollection.txt','r') 
>>> csv_reader = csv.reader(smsdata,delimiter='	') 

The following sys package lines code can be used in case of any utf-8 errors encountered while using older versions of Python, or else does not necessary with the latest version of Python 3.6:

>>> import sys 
>>> reload (sys) 
>>> sys.setdefaultendocing('utf-8') 

Normal coding starts from here as usual:

>>> smsdata_data = [] 
>>> smsdata_labels = [] 
 
>>> for line in csv_reader: 
...     smsdata_labels.append(line[0]) 
...     smsdata_data.append(line[1]) 
 
>>> smsdata.close() 

The following code prints the top 5 lines:

>>> for i in range(5): 
...     print (smsdata_data[i],smsdata_labels[i])

After getting preceding output run following code: 

>>> from collections import Counter 
>>> c = Counter( smsdata_labels ) 
>>> print(c) 

Out of 5,572 observations, 4,825 are ham messages, which are about 86.5 percent and 747 spam messages are about remaining 13.4 percent.

Using NLP techniques, we have preprocessed the data for obtaining finalized word vectors to map with final outcomes spam or ham. Major preprocessing stages involved are:

  • Removal of punctuations: Punctuations needs to be removed before applying any further processing. Punctuations from the string library are !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~, which are removed from all the messages.
  • Word tokenization: Words are chunked from sentences based on white space for further processing.
  • Converting words into lowercase: Converting to all lower case provides removal of duplicates, such as Run and run, where the first one comes at start of the sentence and the later one comes in the middle of the sentence, and so on, which all needs to be unified to remove duplicates as we are working on bag of words technique.
  • Stop word removal: Stop words are the words that repeat so many times in literature and yet are not a differentiator in the explanatory power of sentences. For example: I, me, you, this, that, and so on, which needs to be removed before further processing.
  • of length at least three: Here we have removed words with length less than three.
  • Stemming of words: Stemming process stems the words to its respective root words. Example of stemming is bringing down running to run or runs to run. By doing stemming we reduce duplicates and improve the accuracy of the model.
  • Part-of-speech (POS) tagging:  This applies the speech tags to words, such as noun, verb, adjective, and so on. For example, POS tagging for running is verb, whereas for run is noun. In some situation running is noun and lemmatization will not bring down the word to root word run, instead, it just keeps the running as it is. Hence, POS tagging is a very crucial step necessary for performing prior to applying the lemmatization operation to bring down the word to its root word.
  • Lemmatization of words: Lemmatization is another different process to reduce the dimensionality. In lemmatization process, it brings down the word to root word rather than just truncating the words. For example, bring ate to its root word as eat when we pass the ate word into lemmatizer with the POS tag as verb.

The nltk package has been utilized for all the preprocessing steps, as it consists of all the necessary NLP functionality in one single roof:

>>> import nltk 
>>> from nltk.corpus import stopwords 
>>> from nltk.stem import WordNetLemmatizer 
>>> import string 
>>> import pandas as pd 
>>> from nltk import pos_tag 
>>> from nltk.stem import PorterStemmer  

Function has been written (preprocessing) consists of all the steps for convenience. However, we will be explaining all the steps in each section:

>>> def preprocessing(text): 

The following line of the code splits the word and checks each character if it is in standard punctuations if so it will be replaced with blank and or else it just does not replace with blanks:

...     text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split()) 

The following code tokenizes the sentences into words based on white spaces and put them together as a list for applying further steps:

...     tokens = [word for sent in nltk.sent_tokenize(text2) for word in 
              nltk.word_tokenize(sent)] 

Converting all the cases (upper, lower, and proper) into lowercase reduces duplicates in corpus:

...     tokens = [word.lower() for word in tokens] 

As mentioned earlier, stop words are the words that do not carry much weight in understanding the sentence; they are used for connecting words, and so on. We have removed them with the following line of code:

...     stopwds = stopwords.words('english') 
...     tokens = [token for token in tokens if token not in stopwds]  

Keeping only the words with length greater than 3 in the following code for removing small words, which hardly consists of much of a meaning to carry:

...     tokens = [word for word in tokens if len(word)>=3] 

Stemming is applied on the words using PorterStemmer function, which stems the extra suffixes from the words:

...     stemmer = PorterStemmer() 
...     tokens = [stemmer.stem(word) for word in tokens]  

POS tagging is a prerequisite for lemmatization, based on whether the word is noun or verb, and so on, it will reduce it to the root word:

...     tagged_corpus = pos_tag(tokens)     

The pos_tag function returns the part of speed in four formats for noun and six formats for verb. NN (noun, common, singular), NNP (noun, proper, singular), NNPS (noun, proper, plural), NNS (noun, common, plural), VB (verb, base form), VBD (verb, past tense), VBG (verb, present participle), VBN (verb, past participle), VBP (verb, present tense, not third person singular), VBZ (verb, present tense, third person singular):

...    Noun_tags = ['NN','NNP','NNPS','NNS'] 
...    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ'] 
... lemmatizer = WordNetLemmatizer()

The prat_lemmatize function has been created only for the reasons of mismatch between the pos_tag function and intake values of the lemmatize function. If the tag for any word falls under the respective noun or verb tags category, n or v will be applied accordingly in the lemmatize function:

...     def prat_lemmatize(token,tag): 
...         if tag in Noun_tags: 
...             return lemmatizer.lemmatize(token,'n') 
...         elif tag in Verb_tags: 
...             return lemmatizer.lemmatize(token,'v') 
...         else: 
...             return lemmatizer.lemmatize(token,'n') 

After performing tokenization and applied all the various operations, we need to join it back to form stings and the following function performs the same:

...     pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])              
...     return pre_proc_text 

The following step applies the preprocessing function to the data and generates new corpus:

>>> smsdata_data_2 = [] 
>>> for i in smsdata_data: 
...     smsdata_data_2.append(preprocessing(i))  

Data will be split into train and test based on 70-30 split and converted to the NumPy array for applying machine learning algorithms:

>>> import numpy as np 
>>> trainset_size = int(round(len(smsdata_data_2)*0.70)) 
>>> print ('The training set size for this classifier is ' + str(trainset_size) + '
') 
>>> x_train = np.array([''.join(rec) for rec in smsdata_data_2[0:trainset_size]]) 
>>> y_train = np.array([rec for rec in smsdata_labels[0:trainset_size]]) 
>>> x_test = np.array([''.join(rec) for rec in smsdata_data_2[trainset_size+1:len( smsdata_data_2)]]) 
>>> y_test = np.array([rec for rec in smsdata_labels[trainset_size+1:len( smsdata_labels)]]) 

The following code converts the words into a vectorizer format and applies term frequency-inverse document frequency (TF-IDF) weights, which is a way to increase weights to words with high frequency and at the same time penalize the general terms such as the, him, at, and so on. In the following code, we have restricted to most frequent 4,000 words in the vocabulary, none the less we can tune this parameter as well for checking where the better accuracies are obtained:

# building TFIDF vectorizer  
>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),  stop_words='english',  
    max_features= 4000,strip_accents='unicode',  norm='l2') 

The TF-IDF transformation has been shown as follows on both train and test data. The todense function is used to create the data to visualize the content:

>>> x_train_2 = vectorizer.fit_transform(x_train).todense() 
>>> x_test_2 = vectorizer.transform(x_test).todense() 

Multinomial Naive Bayes classifier is suitable for classification with discrete features (example word counts), which normally requires large feature counts. However, in practice, fractional counts such as TF-IDF will also work well. If we do not mention any Laplace estimator, it does take the value of 1.0 means and it will add 1.0 against each term in numerator and total for denominator:

>>> from sklearn.naive_bayes import MultinomialNB 
>>> clf = MultinomialNB().fit(x_train_2, y_train) 
 
>>> ytrain_nb_predicted = clf.predict(x_train_2) 
>>> ytest_nb_predicted = clf.predict(x_test_2) 
 
>>> from sklearn.metrics import classification_report,accuracy_score 
 
>>> print ("
Naive Bayes - Train Confusion Matrix

",pd.crosstab(y_train, ytrain_nb_predicted,rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Naive Bayes- Train accuracy",round(accuracy_score(y_train, ytrain_nb_predicted),3)) 
>>> print ("
Naive Bayes  - Train Classification Report
",classification_report(y_train, ytrain_nb_predicted)) 
 
>>> print ("
Naive Bayes - Test Confusion Matrix

",pd.crosstab(y_test, ytest_nb_predicted,rownames = ["Actuall"],colnames = ["Predicted"]))       
>>> print ("
Naive Bayes- Test accuracy",round(accuracy_score(y_test, ytest_nb_predicted),3)) 
>>> print ("
Naive Bayes  - Test Classification Report
",classification_report( y_test, ytest_nb_predicted)) 

From the previous results, it is appearing that Naive Bayes has produced excellent results of 96.6 percent test accuracy with significant recall value of 76 percent for spam and almost 100 percent for ham.

However, if we would like to check what are the top 10 features based on their coefficients from Naive Bayes, the following code will be handy for this:

# printing top features  
>>> feature_names = vectorizer.get_feature_names() 
>>> coefs = clf.coef_ 
>>> intercept = clf.intercept_ 
>>> coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) 
 
>>> print ("

Top 10 features - both first & last
") 
>>> n=10 
>>> top_n_coefs = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) 
>>> for (coef_1, fn_1), (coef_2, fn_2) in top_n_coefs: 
...     print('	%.4f	%-15s		%.4f	%-15s' % (coef_1, fn_1, coef_2, fn_2)) 

Though the R language is not a popular choice for NLP processing, here we have presented the code. Readers are encouraged to change the code and see how accuracies are changing for a better understanding of concepts. The R code for Naive Bayes classifier on SMS spam/ham data is as follows:

# Naive Bayes 
smsdata = read.csv("SMSSpamCollection.csv",stringsAsFactors = FALSE)
# Try the following code for reading in case if you have
#issues while reading regularly with above code
#smsdata = read.csv("SMSSpamCollection.csv",
#stringsAsFactors = FALSE,fileEncoding="latin1")
str(smsdata)
smsdata$Type = as.factor(smsdata$Type)
table(smsdata$Type)

library(tm)
library(SnowballC)
# NLP Processing
sms_corpus <- Corpus(VectorSource(smsdata$SMS_Details))
corpus_clean_v1 <- tm_map(sms_corpus, removePunctuation)
corpus_clean_v2 <- tm_map(corpus_clean_v1, tolower)
corpus_clean_v3 <- tm_map(corpus_clean_v2, stripWhitespace)
corpus_clean_v4 <- tm_map(corpus_clean_v3, removeWords, stopwords())
corpus_clean_v5 <- tm_map(corpus_clean_v4, removeNumbers)
corpus_clean_v6 <- tm_map(corpus_clean_v5, stemDocument)

# Check the change in corpus
inspect(sms_corpus[1:3])
inspect(corpus_clean_v6[1:3])

sms_dtm <- DocumentTermMatrix(corpus_clean_v6)

smsdata_train <- smsdata[1:4169, ]
smsdata_test <- smsdata[4170:5572, ]

sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5572, ]

sms_corpus_train <- corpus_clean_v6[1:4169]
sms_corpus_test <- corpus_clean_v6[4170:5572]

prop.table(table(smsdata_train$Type))
prop.table(table(smsdata_test$Type))
frac_trzero = (table(smsdata_train$Type)[[1]])/nrow(smsdata_train)
frac_trone = (table(smsdata_train$Type)[[2]])/nrow(smsdata_train)
frac_tszero = (table(smsdata_test$Type)[[1]])/nrow(smsdata_test)
frac_tsone = (table(smsdata_test$Type)[[2]])/nrow(smsdata_test)

Dictionary <- function(x) {
if( is.character(x) ) {
return (x)
}
stop('x is not a character vector')
}
# Create the dictionary with at least word appears 1 time
sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 1))
sms_train <- DocumentTermMatrix(sms_corpus_train,list(dictionary = sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test,list(dictionary = sms_dict))
convert_tofactrs <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
sms_train <- apply(sms_train, MARGIN = 2, convert_tofactrs)
sms_test <- apply(sms_test, MARGIN = 2, convert_tofactrs)

# Application of Naïve Bayes Classifier with laplace Estimator
library(e1071)
nb_fit <- naiveBayes(sms_train, smsdata_train$Type,laplace = 1.0)

tr_y_pred = predict(nb_fit, sms_train)
ts_y_pred = predict(nb_fit,sms_test)
tr_y_act = smsdata_train$Type;ts_y_act = smsdata_test$Type

tr_tble = table(tr_y_act,tr_y_pred)
print(paste("Train Confusion Matrix"))
print(tr_tble)

tr_acc = accrcy(tr_y_act,tr_y_pred)
trprec_zero = prec_zero(tr_y_act,tr_y_pred); trrecl_zero = recl_zero(tr_y_act,tr_y_pred)
trprec_one = prec_one(tr_y_act,tr_y_pred); trrecl_one = recl_one(tr_y_act,tr_y_pred)
trprec_ovll = trprec_zero *frac_trzero + trprec_one*frac_trone
trrecl_ovll = trrecl_zero *frac_trzero + trrecl_one*frac_trone

print(paste("Naive Bayes Train accuracy:",tr_acc))
print(paste("Naive Bayes - Train Classification Report"))
print(paste("Zero_Precision",trprec_zero,"Zero_Recall",trrecl_zero))
print(paste("One_Precision",trprec_one,"One_Recall",trrecl_one))
print(paste("Overall_Precision",round(trprec_ovll,4),"Overall_Recall",round(trrecl_ovll,4)))

ts_tble = table(ts_y_act,ts_y_pred)
print(paste("Test Confusion Matrix"))
print(ts_tble)

ts_acc = accrcy(ts_y_act,ts_y_pred)
tsprec_zero = prec_zero(ts_y_act,ts_y_pred); tsrecl_zero = recl_zero(ts_y_act,ts_y_pred)
tsprec_one = prec_one(ts_y_act,ts_y_pred); tsrecl_one = recl_one(ts_y_act,ts_y_pred)
tsprec_ovll = tsprec_zero *frac_tszero + tsprec_one*frac_tsone
tsrecl_ovll = tsrecl_zero *frac_tszero + tsrecl_one*frac_tsone

print(paste("Naive Bayes Test accuracy:",ts_acc))
print(paste("Naive Bayes - Test Classification Report"))
print(paste("Zero_Precision",tsprec_zero,"Zero_Recall",tsrecl_zero))
print(paste("One_Precision",tsprec_one,"One_Recall",tsrecl_one))
print(paste("Overall_Precision",round(tsprec_ovll,4),"Overall_Recall",round(tsrecl_ovll,4)))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset