How to do it...

Let's now move on to look at how to build our model.

After reading the data, we use the head() function to take a look it:

 df_messages.head(3)

In the following screenshot, we can see that there are two columns: labels and message. The output is as follows:

We then use the describe() function to look at a few metrics in each of the columns:

df_messages.describe()

This gives us the following metrics:

For the object datatype, the result of describe() will provide metrics, count, unique, top, and freq. top refers to the most common value, while freq is the frequency of this value.

We can also take a look at the metrics by message type, as follows:

df_messages.groupby('labels').describe()

With the preceding command, we see the count, number of unique values, and frequency for each class of the target variable:

To analyze our dataset even further, let's take a look at the word count and the character count for each message:

df_messages['word_count'] = df_messages['message'].apply(lambda x: len(str(x).split(" ")))
df_messages['character_count'] = df_messages['message'].str.len() 

df_messages[['message','word_count', 'character_count']].head()

The lambda function is used to create small, anonymous functions in Python. A lambda function can take any number of arguments, but can only have one expression. This function is passed as a parameter to other functions, such as map, apply, reduce, or filter.

The output of the preceding code will look as follows:

In this case, labels is our target variable. We have two classes: spam and ham. We can see the distribution of spam and ham messages using a bar plot:

labels_count = pd.DataFrame(df_messages.groupby('labels')['message'].count())
labels_count.reset_index(inplace = True)
plt.figure(figsize=(4,4))
sns.barplot(labels_count['labels'], labels_count['message'])
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Labels', fontsize=12)
plt.show()

The following is the output of the preceding code:

In the following code block, we will label spam as 1, and ham as 0:

# create a variable that holds a key-value pair for ham and spam
class_labels = {"ham":0,"spam":1}

# use the class_labels variable with map()
df_messages['labels']=df_messages['labels'].map(class_labels)
df_messages.head()

Notice that, in the following screenshot, under the labels variable, all ham and spam messages are now labelled as 0 and 1 respectively:

We will now split our data into training and testing samples:

# Split your data into train & test set
X_train, X_test, Y_train, Y_test = train_test_split(df_messages[‘message’],
                                 df_messages[‘labels’],test_s=0.2,random_state=1)

We need to convert the collection of messages to a matrix of token counts. This can be done using CountVectorizer():

# Creating an instance of the CountVectorizer class
# If ‘english’, a built-in stop word list for English is used.
# There are known issues with ‘english’ and you should consider an alternative
vectorizer = CountVectorizer(lowercase=True, stop_words=‘english’, analyzer=‘word’)

# Learn a vocabulary from one or more message using the fit_transform() function
vect_train = vectorizer.fit_transform(X_train)

We proceed to build our model with the Naive Bayes algorithm:

# Create an instance of MultinomialNB()
model_nb = MultinomialNB()

# Fit your data to the model
model_nb.fit(vect_train,Y_train)

# Use predict() to predict target class
predict_train = model_nb.predict(vect_train)

We load the required libraries for the evaluation metrics, as follows:

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

We now check our accuracy by evaluating the model with the training data:

# Calculate Train Accuracy
print(‘Accuracy score: {}’.format(accuracy_score(Y_train, predict_train)))

# Calculate other metrics on your train results
print(‘Precision score: {}’.format(precision_score(Y_train, predict_train)))
print(‘Recall score: {}’.format(recall_score(Y_train, predict_train)))
print(‘F1 score: {}’.format(f1_score(Y_train, predict_train)))

The output of this is as follows:

Now we check the accuracy of our test data by evaluating the model with the unseen test data:

# We apply the model into our test data
vect_test = vectorizer.transform(X_test)
prediction = model_nb.predict(vect_test)

# Calculate Test Accuracy
print(‘Accuracy score: {}’.format(accuracy_score(Y_test, prediction)))

# Calculate other metrics on your test data
print(‘Precision score: {}’.format(precision_score(Y_test, prediction)))
print(‘Recall score: {}’.format(recall_score(Y_test, prediction)))
print(‘F1 score: {}’.format(f1_score(Y_test, prediction)))

With the preceding code block, we print performance metrics as follows:

These results may vary with different samples and hyperparameters.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...