How it works...

In Step 1, we looked at our dataset. In Step 2 and Step 3, we looked at the statistics for the ham and spam class labels. In Step 4, we extended our analysis by looking at the word count and the character count for each of the messages in our dataset. In Step 5, we saw the distribution of our target variables (ham and spam), while in Step 6 we encoded our class labels for the target variable with the numbers 1 and 0. In Step 7, we split our dataset into training and testing samples. In Step 8, we used CountVectorizer() from sklearn.feature_extraction.text to convert the collection of messages to a matrix of token counts. 

If you do not provide a dictionary in advance and do not use an analyzer that does some kind of feature selection, then the number of features will be equal to the vocabulary size found by analyzing the data. For more information on this, see the following: https://bit.ly/1pBh3T1.

In Step 9 and Step 10, we built our model and imported the required classes from sklearn.metrics to measure the various scores respectively. In Step 11 and 12, we checked the accuracy of our training and testing datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset