Multiclass classification with the CNN model

Let's now apply the same model for multiclass classification. We will be using the 20 NewsGroup dataset for this. To train a CNN model, this dataset is small. We will still try to do it with a simpler problem. As discussed before, the 20 classes in this dataset have quite a lot of mixing and with SVM we get a maximum of 70% accuracy. Here, we will take the six broad categories of this dataset and try to build a CNN classifier. So, first we will map the 20 categories to the six broad categories. Following is the code to first load the dataset from scikit learn:

def load_20newsgroup_data(categories = None, subset='all'):
        data = fetch_20newsgroups(subset=subset,
                              shuffle=True,
                              remove=('headers', 'footers', 'quotes'),
                              categories = categories)
        return data

dataset = Loader.load_20newsgroup_data(subset='train')
corpus, labels = dataset.data, dataset.target

test_dataset = Loader.load_20newsgroup_data(subset='test')
test_corpus, test_labels = test_dataset.data, test_dataset.target

Next, we will map the 20 classes to the six categories as follows:

six_groups = {
    'comp.graphics':0,'comp.os.ms-  
       windows.misc':0,'comp.sys.ibm.pc.hardware':0,
    'comp.sys.mac.hardware':0, 'comp.windows.x':0,
    
    'rec.autos':1, 'rec.motorcycles':1, 'rec.sport.baseball':1,   
    'rec.sport.hockey':1,
    'sci.crypt':2, 'sci.electronics':2,'sci.med':2, 'sci.space':2,
    'misc.forsale':3,
    'talk.politics.misc':4, 'talk.politics.guns':4,    
    'talk.politics.mideast':4,
    'talk.religion.misc':5, 'alt.atheism':5, 'soc.religion.christian':5
    }

map_20_2_6 = [six_groups[dataset.target_names[i]] for i in range(20)]
labels = [six_groups[dataset.target_names[i]] for i in labels] 
test_labels = [six_groups[dataset.target_names[i]] for i in  
               test_labels]

We will do the same preprocessing steps followed by model initialization. Here, also, we have used GloVe embeddings to initialize the word embedding vectors. The detailed code is available in the repository, in the 20newsgrp_model module. Here are the model's hyperparameters:

{  
    "embedding_dim":50,
    "train_embedding":false,
    "embedding_regularizer_l2":0.0,
    "sentence_len":30,
    "num_sentences":10,
    "word_kernel_size":5,
    "word_filters":30,
    "sent_kernel_size":5,
    "sent_filters":20,
    "sent_k_maxpool":3,
    "input_dropout":0.2,
    "doc_k_maxpool":4,
    "sent_dropout":0.3,
    "hidden_dims":64,
    "conv_activation":"relu",
    "hidden_activation":"relu",
    "hidden_dropout":0,
    "num_hidden_layers":2,
    "hidden_gaussian_noise_sd":0.3,
    "final_layer_kernel_regularizer":0.01,
    "hidden_layer_kernel_regularizer":0.0,
    "learn_word_conv":true,
    "learn_sent_conv":true,
    "num_units_final_layer":6
 }

Here are the detailed results of the model on the test set:

               precision recall f1-score support
 0                  0.80   0.91     0.85    1912
 1                  0.86   0.85     0.86    1534
 2                  0.75   0.79     0.77    1523
 3                  0.88   0.34     0.49     382
 4                  0.78   0.76     0.77    1027
 5                  0.84   0.79     0.82     940
 
 avg / total        0.81   0.80     0.80    7318
 
 [[1733    41   114    1   14    9]
  [  49  1302   110   11   47   15]
  [ 159    63  1196    5   75   25]
  [ 198    21    23  130    9    1]
  [  10    53    94    0  782   88]
  [  22    30    61    0   81  746]]
 0.8047280677780815

Let's try an SVM on this dataset and see what is the best accuracy we get:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
tv = TfidfVectorizer(use_idf=True, min_df=0.00005, max_df=1.0, 
                     ngram_range=(1, 1), stop_words = 'english', 
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(corpus)
tv_test_features = tv.transform(test_corpus)

clf = SVC(C=1,kernel='linear', random_state=1, gamma=0.01)
svm=clf.fit(tv_train_features, labels)
preds_test = svm.predict(tv_test_features)


from sklearn.metrics import 
       classification_report,accuracy_score,confusion_matrix

print(classification_report(test_labels, preds_test))
print(confusion_matrix(test_labels, preds_test))
print(accuracy_score(test_labels, preds_test))

The following are the results from an SVM model. We have tuned the parameter C such that we get the best cross-validation accuracy:

                precision recall f1-score support
 
   0                 0.86   0.89     0.87    1912
   1                 0.83   0.89     0.86    1534
   2                 0.75   0.78     0.76    1523
   3                 0.87   0.73     0.80     382
   4                 0.82   0.75     0.79    1027
   5                 0.85   0.76     0.80     940
 
   avg / total       0.82   0.82     0.82    7318
 
 0.82344902978956

Thus, we see that this text CNN model can give comparable results in case of multi-class classification results as well. Again, as before, now one can use the trained model to perform text summarization as well.

Table of Contents for Multiclass classification with the CNN model

Create new playlist

Sign In

Sign Up

Table of Contents for
Multiclass classification with the CNN model