As a first example, we will look into the problem of identifying spam in YouTube video comments. The complete Jupyter Notebook for this example is available under the Chapter05/02_example.ipynb directory in this book's code repository. The data contains the comments with binary labels specifying whether the comment is genuine or spam. The code that follows loads the comments in CSV format into a pandas DataFrame:
comments_df_list = []
comments_file = ['data/Youtube01-Psy.csv','data/Youtube02-KatyPerry.csv','data/Youtube03-LMFAO.csv',
'data/Youtube04-Eminem.csv','data/Youtube05-Shakira.csv']
for f in comments_file:
df = pd.read_csv(f,header=0)
comments_df_list.append(df)
comments_df = pd.concat(comments_df_list)
comments_df = comments_df.sample(frac=1.0)
print(comments_df.shape)
comments_df.head(5)
The following output shows a sample of the YouTube comments with the various fields:
COMMENT_ID |
AUTHOR |
DATE |
CONTENT |
CLASS | |
---|---|---|---|---|---|
102 |
z12dfr5irwr5chwm3232gvnq2laqcdezn04 |
Carlos Rueda |
2015-05-22T15:04:20.310000 |
I am going to blow my mind |
0 |
117 |
z133ibkihkmaj3bfq22rilaxmp2yt54nb |
Debora Favacho (Debora Sparkle) |
2015-05-21T14:08:41.338000 |
BEST SONG EVER X3333333333 |
0 |
331 |
_2viQ_Qnc68Qq98m0mmx4rlprYiD6aYgMb2x3bdupEM |
Hidden Love |
2013-08-01T09:19:56.654000 |
Hi. Check out and share our songs. |
1 |
322 |
z13cedgolkfvw3xey22kcnzrfm3egjj0z |
Rafael Diaz Jr |
2015-01-25T20:57:46.039000 |
Check out this video on YouTube: |
1 |
133 |
LneaDw26bFugQanw0UtVOqzEgWt6mBD0k6SsEV7u968 |
Jacob Johnson |
NaN |
You guys should check out this EXTRAORDINARY w... |
1 |
Here, we load the .csv files from five popular YouTube videos and load them into a pandas DataFrame. We can see from the output that it shows both a valid comment and spam. The CONTENT column contains the comment text and the CLASS column is set to 1 for spam and 0 otherwise. We will use the average of all comments size as the maximum size of each comment by using the following code. Any comments with the number of words greater than this will be truncated to keep the training data of fixed length:
average_comments_size = int(sum([len(c) for c in comments_df.CONTENT])/comments_df.shape[0])
print(average_comments_size)
We utilize the vocabulary_processor to preprocess all the comments and split the data into training and test, with an 80% to 20% ratio, as shown in the following code:
vocabulary_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(average_comments_size)
X_transform = vocabulary_processor.fit_transform(comments_df.CONTENT)
X_transform = np.array(list(X_transform))
y = comments_df.CLASS.values
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_transform,
y, test_size=0.2, random_state=42)
n_words = len(vocabulary_processor.vocabulary_)
In this example and subsequent examples in this chapter, we will utilize the TensorFlow estimator API to create, train, and test the models. Estimators in TensorFlow provide an easy to use interface for building the graph, initializing variables, creating checkpoint files, and saving summaries for TensorBoard viewing. We will use the following code to create the estimator:
def get_estimator_spec(input_logits, out_lb, train_predict_m):
preds_cls = tf.argmax(input_logits, 1)
if train_predict_m == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=train_predict_m,
predictions={
'pred_class': preds_cls,
'pred_prob': tf.nn.softmax(input_logits)
})
tr_l = tf.losses.sparse_softmax_cross_entropy(labels=out_lb, logits=input_logits)
if train_predict_m == tf.estimator.ModeKeys.TRAIN:
adm_opt = tf.train.AdamOptimizer(learning_rate=0.01)
tr_op = adm_opt.minimize(tr_l, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(train_predict_m, loss=tr_l, train_op=tr_op)
eval_metric_ops = {'accuracy': tf.metrics.accuracy(labels=out_lb, predictions=preds_cls)}
return tf.estimator.EstimatorSpec(train_predict_m, loss=tr_l, train_op=tr_op)
The AdamOptimizer is used for optimizing the loss function for which we use the tf.losses.sparse_softmax_cross_entropy function. This computes the cross-entropy given the logits with probability distribution across the two classes and the true labels:
def rnn_model_fn(features, labels, mode):
comments_wd_vec = tf.contrib.layers.embed_sequence(
features[COMMENTS_FT], vocab_size=n_words, embed_dim=EMBED_DIMENSION)
comments_word_list = tf.unstack(comments_wd_vec, axis=1)
rnn_cell = tf.nn.rnn_cell.GRUCell(average_comments_size)
_, comments_encoding = tf.nn.static_rnn(rnn_cell, comments_word_list, dtype=tf.float32)
logits = tf.layers.dense(inputs=comments_encoding, units=2, activation=None)
return get_estimator_spec(input_logits=logits, out_lb=labels, train_predict_m=mode)
As described in the previous section, on meta-architecture for text classification for the model, we utilize an embedding layer followed by a GRUCell. The output of the GRU is fed to a dense layer that computes the logits. The logits output is then passed to the softmax layer to compute the class predictions. The GRU cell used in this example is similar to the LSTM with the difference being it outputs the hidden state with no control gates. Therefore, the LSTM has one extra gate compared to the GRU. In addition to that, unlike LSTMs, GRUs may not be able to remember long-term word associations. However, for this specific task, the differences may not be significant. You can also experiment with replacing the GRU cell with an LSTM in the code, as shown here:
run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir='/tmp/models/',save_summary_steps=10,log_step_count_steps=10)
classifier = tf.estimator.Estimator(model_fn=rnn_model_fn,config=run_config)
We store the model checkpoint file under the /tmp/models directory using RunConfig. The logging and summary step frequency is also modified from its default value of 100 to 10 for better visualization in TensorBoard. You can also modify these values accordingly in the code. Finally, we train the model for 200 steps by using the following code:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={COMMENTS_FT: X_train},
y=y_train,
batch_size=128,
num_epochs=None,
shuffle=True)
classifier.train(input_fn=train_input_fn, steps=200)
Output
INFO:tensorflow:Saving checkpoints for 200 into /tmp/models/model.ckpt.
INFO:tensorflow:Loss for final step: 0.000836024.
The model is evaluated on the test data and we obtain an accuracy of 94%, as you can see from the following code:
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={COMMENTS_FT: X_test},
y=y_test,
num_epochs=1,
shuffle=False)
preds = classifier.predict(input_fn=test_input_fn)
y_predicted = np.array(list(p['pred_class'] for p in preds))
y_predicted = y_predicted.reshape(np.array(y_test).shape)
acc = metrics.accuracy_score(y_test, y_predicted)
print('Accuracy: {0:f}'.format(acc))
The following is the output of the preceding code:
INFO:tensorflow:Restoring parameters from /tmp/models/model.ckpt-200 Accuracy: 0.905612
To visualize the graph and training process, start TensorBoard with the logdir pointing to the same path we used in RunConfig. Point the browser to localhose:6006 (default) to view the graphs and plots. We can see that the loss steadily decreases with the training step, as shown in the following screenshot:
Visualization of the model graph shows the input, embedding layer, RNN cell, the dense layer, and the softmax output:
In the embedding projection, we can also see that there is a clear separation of the word embedding into the two clusters learned by the model. The following screenshot shows an example of the words associated with spam and those associated with a genuine comment. This shows how the model has pushed the word vectors associated with these two classes into separate clusters:
We have seen how an RNN-based deep learning model can be used for text classification. As explained previously, we can also use a CNN to do text classification, which we will explore next.