How it works...

In the Getting ready section, we imported all the required libraries and defined the function to plot the confusion matrix. We read our dataset, using UTF8 encoding. We checked the proportion of spam and ham messages in our dataset and used the CountVectorizer and TfidfVectorizer modules to convert the texts into vectors and TF-IDF vectors, respectively.

After that, we built multiple models using various algorithms. We also applied each algorithm on both the count data and the TF-IDF data.

The models need to be built in the following order:

  1. Naive Bayes on count data
  2. Naive Bayes on TF-IDF data
  3. SVM with RBF kernel on count data
  4. SVM with RBF kernel on TF-IDF data
  5. Random forest on count data
  6. Random forest on TF-IDF data

The Naive Bayes classifier is widely used for text classification in machine learning. The Naive Bayes algorithm is based on the conditional probability of features belonging to a class. In Step 1, we built our first model with the Naive Bayes algorithm on the count data. In Step 2, we checked the performance metrics using classification_report() to see the precision, recall, f1-score, and support. In Step 3, we called plot_confusion_matrix() to plot the confusion matrix. 

Then, in Step 4, we built the Naive Bayes model on the TF-IDF data and evaluated the performance in Step 5. In Step 6 and Step 7, we trained our model using the support vector machine on the count data, evaluated its performance using the output from classification_report, and plotted the confusion matrix. We trained our SVM model using the RBF kernel. We also showcased an example of using GridSearchCV to find the best parameters. In Step 8 and Step 9, we repeated what we did in Step 6 and Step 7, but this time, we trained the SVM on TF-IDF data.

In Step 10, we trained a random forest model using grid search on the count data. We set gini and entropy for the criterion hyperparameter. We also set multiple values for the parameters, such as min_samples_split, max_depth,  and min_samples_leaf. In Step 11, we evaluated the model's performance.

We then trained another random forest model on the TF-IDF data in Step 12. Using the predic_proba() function, we got the class probabilities on our test data. We used the same in Step 13 to plot the ROC curves with AUC scores annotated on the plots for each of the models. This helps us to compare the performance of the models.

In Step 14, we averaged the probabilities, which we got from the models for both the count and TF-IDF data. We then plotted the ROC curves for the ensemble results. From Step 15 through to Step 17, we plotted the test accuracy for each of the models built on the count data as well as the TF-IDF data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset