Let's now apply the same model for multiclass classification. We will be using the 20 NewsGroup dataset for this. To train a CNN model, this dataset is small. We will still try to do it with a simpler problem. As discussed before, the 20 classes in this dataset have quite a lot of mixing and with SVM we get a maximum of 70% accuracy. Here, we will take the six broad categories of this dataset and try to build a CNN classifier. So, first we will map the 20 categories to the six broad categories. Following is the code to first load the dataset from scikit learn:
def load_20newsgroup_data(categories = None, subset='all'):
data = fetch_20newsgroups(subset=subset,
shuffle=True,
remove=('headers', 'footers', 'quotes'),
categories = categories)
return data
dataset = Loader.load_20newsgroup_data(subset='train')
corpus, labels = dataset.data, dataset.target
test_dataset = Loader.load_20newsgroup_data(subset='test')
test_corpus, test_labels = test_dataset.data, test_dataset.target
Next, we will map the 20 classes to the six categories as follows:
six_groups = {
'comp.graphics':0,'comp.os.ms-
windows.misc':0,'comp.sys.ibm.pc.hardware':0,
'comp.sys.mac.hardware':0, 'comp.windows.x':0,
'rec.autos':1, 'rec.motorcycles':1, 'rec.sport.baseball':1,
'rec.sport.hockey':1,
'sci.crypt':2, 'sci.electronics':2,'sci.med':2, 'sci.space':2,
'misc.forsale':3,
'talk.politics.misc':4, 'talk.politics.guns':4,
'talk.politics.mideast':4,
'talk.religion.misc':5, 'alt.atheism':5, 'soc.religion.christian':5
}
map_20_2_6 = [six_groups[dataset.target_names[i]] for i in range(20)]
labels = [six_groups[dataset.target_names[i]] for i in labels]
test_labels = [six_groups[dataset.target_names[i]] for i in
test_labels]
We will do the same preprocessing steps followed by model initialization. Here, also, we have used GloVe embeddings to initialize the word embedding vectors. The detailed code is available in the repository, in the 20newsgrp_model module. Here are the model's hyperparameters:
{
"embedding_dim":50,
"train_embedding":false,
"embedding_regularizer_l2":0.0,
"sentence_len":30,
"num_sentences":10,
"word_kernel_size":5,
"word_filters":30,
"sent_kernel_size":5,
"sent_filters":20,
"sent_k_maxpool":3,
"input_dropout":0.2,
"doc_k_maxpool":4,
"sent_dropout":0.3,
"hidden_dims":64,
"conv_activation":"relu",
"hidden_activation":"relu",
"hidden_dropout":0,
"num_hidden_layers":2,
"hidden_gaussian_noise_sd":0.3,
"final_layer_kernel_regularizer":0.01,
"hidden_layer_kernel_regularizer":0.0,
"learn_word_conv":true,
"learn_sent_conv":true,
"num_units_final_layer":6
}
Here are the detailed results of the model on the test set:
precision recall f1-score support
0 0.80 0.91 0.85 1912
1 0.86 0.85 0.86 1534
2 0.75 0.79 0.77 1523
3 0.88 0.34 0.49 382
4 0.78 0.76 0.77 1027
5 0.84 0.79 0.82 940
avg / total 0.81 0.80 0.80 7318
[[1733 41 114 1 14 9]
[ 49 1302 110 11 47 15]
[ 159 63 1196 5 75 25]
[ 198 21 23 130 9 1]
[ 10 53 94 0 782 88]
[ 22 30 61 0 81 746]]
0.8047280677780815
Let's try an SVM on this dataset and see what is the best accuracy we get:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
tv = TfidfVectorizer(use_idf=True, min_df=0.00005, max_df=1.0,
ngram_range=(1, 1), stop_words = 'english',
sublinear_tf=True)
tv_train_features = tv.fit_transform(corpus)
tv_test_features = tv.transform(test_corpus)
clf = SVC(C=1,kernel='linear', random_state=1, gamma=0.01)
svm=clf.fit(tv_train_features, labels)
preds_test = svm.predict(tv_test_features)
from sklearn.metrics import
classification_report,accuracy_score,confusion_matrix
print(classification_report(test_labels, preds_test))
print(confusion_matrix(test_labels, preds_test))
print(accuracy_score(test_labels, preds_test))
The following are the results from an SVM model. We have tuned the parameter C such that we get the best cross-validation accuracy:
precision recall f1-score support
0 0.86 0.89 0.87 1912
1 0.83 0.89 0.86 1534
2 0.75 0.78 0.76 1523
3 0.87 0.73 0.80 382
4 0.82 0.75 0.79 1027
5 0.85 0.76 0.80 940
avg / total 0.82 0.82 0.82 7318
0.82344902978956
Thus, we see that this text CNN model can give comparable results in case of multi-class classification results as well. Again, as before, now one can use the trained model to perform text summarization as well.