Building a review sentiment classifier

Let's now build a sentiment classifier by training the preceding CNN document model. We will be using the Amazon Reviews for Sentiment Analysis dataset from https://www.kaggle.com/bittlingmayer/amazonreviews to train this model. This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels). Here is the the data format: label followed by space, review titles followed by : and a space, are prepended to the review text. This dataset is much bigger than the popular IMDB Movie Review dataset. Also, this dataset contains quite a diverse set of reviews of various products and also movies:

__label__<X> <summary/title>: <Review Text>

Example:
__label__2 Good Movie: Awesome.... simply awesome. I couldn't put this down and laughed, smiled, and even got tears! A brand new favorite author.

Here, __label__1 corresponds to 1 and 2-star reviews, and __label__2 corresponds to 4 and 5-star reviews. However, 3-star reviews, that is, reviews with a neutral sentiment, were not included in this dataset. We have 3.6 million training examples and 400,000 test examples in total in this dataset. We will take a random sample of size 200,000 from training examples to start with so that we can guess a good hyperparameter for the training to proceed: 

train_df = Loader.load_amazon_reviews('train')
print(train_df.shape)

test_df = Loader.load_amazon_reviews('test')
print(test_df.shape)

dataset = train_df.sample(n=200000, random_state=42)
dataset.sentiment.value_counts()

Next, we use the Preprocess class to convert the corpus to padded word index sequences as follows:

preprocessor = Preprocess()
corpus_to_seq = preprocessor.fit(corpus=corpus)

holdout_corpus = test_df['review'].values
holdout_target = test_df['sentiment'].values
holdout_corpus_to_seq = preprocessor.transform(holdout_corpus)

Let's initialize the embedding with GloVe using the GloVe class, and build the document model. We also need to define the document model parameters, such as number of convolution filters, activation functions, hidden units, and so on. To avoid overfitting of the network, we can interleave dropout layers in between the input layers, convolution layers, and even the final layer or the dense layer. Also, as we observed for the dense layer, putting a Gaussian noise layer acts as a better regularizer. The DocumentModel class can be initialized with all these parameters defined as follows. To come up with a good initialization for the model parameters, we start with a small number of epochs and small sampled training examples. Initially, we started with six word convolution filters, as mentioned in the paper for IMDB data, and found that the model was underfitting—the accuracy of training was not going beyond 80%, and then we kept on slowly increasing the number of word filters. Similarly, we found a good number of sentence convolution filters. We tried both ReLU and tanh activations for the convolution layers. As mentioned in the paper https://arxiv.org/pdf/1406.3830.pdf, they used tanh activations for their model:

glove=GloVe(50)
initial_embeddings = glove.get_embedding(preprocessor.word_index)

amazon_review_model =
DocumentModel(vocab_size=preprocessor.get_vocab_size(),
word_index = preprocessor.word_index,
num_sentences = Preprocess.NUM_SENTENCES,
embedding_weights = initial_embeddings,
conv_activation = 'tanh',
hidden_dims=64,
input_dropout=0.40,
hidden_gaussian_noise_sd=0.5)

Following is the full list of parameters for this model, which we have used for training on the full 3.6 million training examples:

{  
"embedding_dim":50,
"train_embedding":true,
"sentence_len":30,
"num_sentences":10,
"word_kernel_size":5,
"word_filters":30,
"sent_kernel_size":5,
"sent_filters":16,
"sent_k_maxpool":3,
"input_dropout":0.4,
"doc_k_maxpool":4,
"sent_dropout":0,
"hidden_dims":64,
"conv_activation":"relu",
"hidden_activation":"relu",
"hidden_dropout":0,
"num_hidden_layers":1,
"hidden_gaussian_noise_sd":0.5,
"final_layer_kernel_regularizer":0.0,
"learn_word_conv":true,
"learn_sent_conv":true
}

Finally, before we start full training, we need to figure out a good batch size. With large batch sizes like 256 the training was very slow, and hence we used a batch size of 64. We used the rmsprop optimizer to train our model and started with the default learning rate that keras uses. Here is the complete list of training parameters, which are stored in the class TrainingParameters:

   {"seed":55,
"batch_size":64,
"num_epochs":35,
"validation_split":0.05,
"optimizer":"rmsprop",
"learning_rate":0.001}

The following is the code to start the training:

train_params = TrainingParameters('model_with_tanh_activation')

amazon_review_model.get_classification_model().compile(
loss="binary_crossentropy",
optimizer=
train_params.optimizer,
metrics=["accuracy"])
checkpointer = ModelCheckpoint(filepath=train_params.model_file_path,
verbose=1,
save_best_only=True,
save_weights_only=True)

x_train = np.array(corpus_to_seq)
y_train = np.array(target)

x_test = np.array(holdout_corpus_to_seq)
y_test = np.array(holdout_target)

amazon_review_model.get_classification_model().fit(x_train, y_train,
batch_size=train_params.batch_size,
epochs=train_params.num_epochs,
verbose=2,
validation_split=train_params.validation_split,
callbacks=[checkpointer])

We have trained this model on CPU, and following is the result after five epochs. With 190k samples only one epoch is quite slow and takes about 10 minutes to run. However, you can see in the following that the training and validation accuracy after five epochs reaches 92%, which is quite good:

Train on 190000 samples, validate on 10000 samples
Epoch 1/35
- 577s - loss: 0.3891 - acc: 0.8171 - val_loss: 0.2533 - val_acc: 0.8369
Epoch 2/35
- 614s - loss: 0.2618 - acc: 0.8928 - val_loss: 0.2198 - val_acc: 0.9137
Epoch 3/35
- 581s - loss: 0.2332 - acc: 0.9067 - val_loss: 0.2105 - val_acc: 0.9191
Epoch 4/35
- 640s - loss: 0.2197 - acc: 0.9128 - val_loss: 0.1998 - val_acc: 0.9206
Epoch 5/35
...
...

We evaluated this model on the hold out set of 400k reviews, and we get 92% accuracy there as well. This clearly shows that the model is fitting well for this review data, and with more data there is some more scope for improvement. In this whole training process till now, the major use of transfer learning is the GloVe embedding vectors being used for initializing the word embedding. Here, as we have huge data, we could have learned the weights from scratch. However, let's see which word embeddings are updated the most in the whole training process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset