Classifying news articles by topic using a CNN

For this example, we will use the dataset of references to news web pages collected by a news aggregator. There are four categories in the dataset belonging to the news of science and technology, business, entertainment, and health. The complete Jupyter Notebook for this example can be found under the Chapter05/03_example.ipynb directory in this book's code repository.

We will first look at the sample of the data from this dataset:

news_df = pd.read_csv('data/newsCorpora.csv',delimiter='	', header=None, 
names=['ID','TITLE','URL','PUBLISHER','CATEGORY','STORY','HOSTNAME','TIMESTAMP'])
news_df = news_df.sample(frac=1.0)
news_df.head(5)

The dataset is represented in the table format as follows:

ID	TITLE	CATEGORY
225897	Fed's Dudley sees relatively slow rate hike...	b
26839	L'Wren Scott's death officially ruled as suicide	e
32332	Idina Menzel Says She Benefited From John Tr...	e
188058	Weak earnings and the dark cloud of Ukraine co...	b
22176	Android Wear- Google's latest PLAY	t

Here, we are only interested in the TITLE and CATEGORY columns. The CATEGORY column will be set to one of the four categories to which the news belongs to. This is represented by characters t, b, e, and m for science and technology, business, entertainment, and health, respectively. As in the previous example, we take the average size of the TITLE as the maximum document size to fix the length of the training instances.

The TensorFlow vocabulary_processor is used for preprocessing each TITLE text to a list of words with the length of the list fixed to the average title size. If the number of words is greater than the average title size, the text is truncated. If it is less than the average title size, it is padded with zeros. This is achieved with the following code:

lencoder = LabelEncoder()
voc_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(average_title_size)
X_transform = voc_processor.fit_transform(news_df.TITLE)
X_transform = np.array(list(X_transform))
y = lencoder.fit_transform(news_df.CATEGORY.values)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_transform, 
                                    y, test_size=0.2, random_state=42)
n_words = len(voc_processor.vocabulary_)
n_classes = len(lencoder.classes_)

The AdamOptimizer with sparse softmax cross-entropy is used for training the model, like before. For the model, we replace the GRUCell in the previous example with a convolutional layer followed by max pooling. The number of filters is set to 5 while the filter height is set to 3, with the width equal to the embedding size. Therefore, the filter convolution is performed three words at a time with a single stride, as specified in the stride parameters. The number of convolutional layers is set to one for a simple architecture and faster training. You can experiment with increasing both the number of filters as well as the convolutional layers. Note that we have also reshaped the input embeddings to four-dimensional tensors with the last channel dimension being 1. The following code shows how the CNN model is constructed:

filter_size=3
num_filters=5
def cnn_model_fn(features,labels,mode):
    news_word_vectors = tf.contrib.layers.embed_sequence(features[NEWS_FT], vocab_size=n_words, 
                                                         embed_dim=WORD_EMBEDDING_SIZE)
    news_word_vectors = tf.expand_dims(news_word_vectors, -1)
    filter_shape = [filter_size, WORD_EMBEDDING_SIZE, 1, num_filters]
    W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
    b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
    conv1 = tf.nn.conv2d(news_word_vectors,
            W,
            strides=[1, 1, 1, 1],
            padding="VALID",
            name="conv1")
    relu1 = tf.nn.relu(tf.nn.bias_add(conv1, b), name="relu")
    pool1 = tf.nn.max_pool(
            relu1,
            ksize=[1, average_title_size - 3 + 1, 1, 1],
            strides=[1, 1, 1, 1],
            padding='VALID',
            name="pool1")
    activations1 = tf.contrib.layers.flatten(pool1)
    logits = tf.contrib.layers.fully_connected(activations1,n_classes,activation_fn=None)
    return get_estimator_spec(input_logits=logits, out_lb=labels, train_predict_m=mode)

As before, we train the model for 200 training steps and evaluate it on the test data by using the following code:

run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir='/tmp/models/',save_summary_steps=10,log_step_count_steps=10)
classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,config=run_config)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={NEWS_FT: X_train},
      y=y_train,
      batch_size=len(X_train),
      num_epochs=None,
      shuffle=True)
classifier.train(input_fn=train_input_fn, steps=100)

The following is the output result for the test data:

INFO:tensorflow:Restoring parameters from /tmp/models/model.ckpt-27
Accuracy: 0.903094
[[20990   534   108  1559]
 [  410 29606   142   331]
 [  493  1432  5741  1381]
 [ 1266   369   162 19960]]

We can see both the accuracy and the confusion matrix. The accuracy is around 93% and the diagonal column of the confusion matrix shows the correct predictions for the document type. We can visualize the model graph and learning metrics plots in TensorBoard:

Training loss and steps/sec

We can see the training loss decreasing with the number of training steps. The accuracy can be improved by adding more convolutional layers and increasing the number of filters, as shown in the following screenshot:

CNN model for text classification

The graph shows the input, embedding layer, convolutional layer, max pooling, dense layer, and the softmax output.

We have seen how the same neural network meta-architecture can be utilized for text classification. The only difference between the two examples was that the first used an RNN as the deep representation whereas the latter example used a CNN. But in both examples, the input word embedding's were learned from scratch using the training data. In situations where we have less training data, we can turn to other approaches like transfer learning. In the next example, we will use pre-learned word embedding for text classification.

Table of Contents for Classifying news articles by topic using a CNN

Create new playlist

Sign In

Sign Up

Table of Contents for
Classifying news articles by topic using a CNN