Multiclass classification with the CNN model

Let's now apply the same model for multiclass classification. We will be using the 20 NewsGroup dataset for this. To train a CNN model, this dataset is small. We will still try to do it with a simpler problem. As discussed before, the 20 classes in this dataset have quite a lot of mixing and with SVM we get a maximum of 70% accuracy. Here, we will take the six broad categories of this dataset and try to build a CNN classifier. So, first we will map the 20 categories to the six broad categories. Following is the code to first load the dataset from scikit learn:

def load_20newsgroup_data(categories = None, subset='all'):
data = fetch_20newsgroups(subset=subset,
shuffle=True,
remove=('headers', 'footers', 'quotes'),
categories = categories)
return data

dataset = Loader.load_20newsgroup_data(subset='train')
corpus, labels = dataset.data, dataset.target

test_dataset = Loader.load_20newsgroup_data(subset='test')
test_corpus, test_labels = test_dataset.data, test_dataset.target

Next, we will map the 20 classes to the six categories as follows:

six_groups = {
'comp.graphics':0,'comp.os.ms-
windows.misc':0,'comp.sys.ibm.pc.hardware':0,
'comp.sys.mac.hardware':0, 'comp.windows.x':0,

'rec.autos':1, 'rec.motorcycles':1, 'rec.sport.baseball':1,
'rec.sport.hockey':1,
'sci.crypt':2, 'sci.electronics':2,'sci.med':2, 'sci.space':2,
'misc.forsale':3,
'talk.politics.misc':4, 'talk.politics.guns':4,
'talk.politics.mideast':4,
'talk.religion.misc':5, 'alt.atheism':5, 'soc.religion.christian':5
}

map_20_2_6 = [six_groups[dataset.target_names[i]] for i in range(20)]
labels = [six_groups[dataset.target_names[i]] for i in labels]
test_labels = [six_groups[dataset.target_names[i]] for i in
test_labels]

We will do the same preprocessing steps followed by model initialization. Here, also, we have used GloVe embeddings to initialize the word embedding vectors. The detailed code is available in the repository, in the 20newsgrp_model module. Here are the model's hyperparameters:

{  
"embedding_dim":50,
"train_embedding":false,
"embedding_regularizer_l2":0.0,
"sentence_len":30,
"num_sentences":10,
"word_kernel_size":5,
"word_filters":30,
"sent_kernel_size":5,
"sent_filters":20,
"sent_k_maxpool":3,
"input_dropout":0.2,
"doc_k_maxpool":4,
"sent_dropout":0.3,
"hidden_dims":64,
"conv_activation":"relu",
"hidden_activation":"relu",
"hidden_dropout":0,
"num_hidden_layers":2,
"hidden_gaussian_noise_sd":0.3,
"final_layer_kernel_regularizer":0.01,
"hidden_layer_kernel_regularizer":0.0,
"learn_word_conv":true,
"learn_sent_conv":true,
"num_units_final_layer":6
}

Here are the detailed results of the model on the test set:

               precision recall f1-score support
0 0.80 0.91 0.85 1912
1 0.86 0.85 0.86 1534
2 0.75 0.79 0.77 1523
3 0.88 0.34 0.49 382
4 0.78 0.76 0.77 1027
5 0.84 0.79 0.82 940

avg / total 0.81 0.80 0.80 7318

[[1733 41 114 1 14 9]
[ 49 1302 110 11 47 15]
[ 159 63 1196 5 75 25]
[ 198 21 23 130 9 1]
[ 10 53 94 0 782 88]
[ 22 30 61 0 81 746]]
0.8047280677780815

Let's try an SVM on this dataset and see what is the best accuracy we get:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
tv = TfidfVectorizer(use_idf=True, min_df=0.00005, max_df=1.0,
ngram_range=(1, 1), stop_words = 'english',
sublinear_tf=True)
tv_train_features = tv.fit_transform(corpus)
tv_test_features = tv.transform(test_corpus)

clf = SVC(C=1,kernel='linear', random_state=1, gamma=0.01)
svm=clf.fit(tv_train_features, labels)
preds_test = svm.predict(tv_test_features)


from sklearn.metrics import
classification_report,accuracy_score,confusion_matrix

print(classification_report(test_labels, preds_test))
print(confusion_matrix(test_labels, preds_test))
print(accuracy_score(test_labels, preds_test))

The following are the results from an SVM model. We have tuned the parameter C such that we get the best cross-validation accuracy:

                precision recall f1-score support

0 0.86 0.89 0.87 1912
1 0.83 0.89 0.86 1534
2 0.75 0.78 0.76 1523
3 0.87 0.73 0.80 382
4 0.82 0.75 0.79 1027
5 0.85 0.76 0.80 940

avg / total 0.82 0.82 0.82 7318

0.82344902978956

Thus, we see that this text CNN model can give comparable results in case of multi-class classification results as well. Again, as before, now one can use the trained model to perform text summarization as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset