Transfer learning – application to the IMDB dataset

One of the scenarios where we should use transfer learning is when we have much less labeled data for the task in hand but lot of training data for a similar but a different domain. The IMDB dataset (http://ai.stanford.edu/~amaas/data/sentiment/) is a binary sentiment classification dataset. It has a set of 25,000 movie reviews for training and 25,000 for testing. There are a lot of published papers on this dataset, and probably the best result on this dataset is achieved by the paragraph vector of Le and Mikolov (https://arxiv.org/pdf/1405.4053.pdf) from Google. They achieved 92.58% accuracy on this dataset. SVM achieved 89% on this. This dataset is of decent size, and we can train the CNN model from scratch on this data. This gives us a result on a par with an SVM. It's discussed in next section.

Now, let's try to build a model with a small sample of IMDB data—say, 5% of the data. In many practical scenarios, we face the problem of insufficient training data. We cannot train a CNN with this small dataset. So, we will use transfer learning to build a model for this dataset. 

We follow the same steps first to preprocess and prepare the data as we did for the other dataset:

train_df = Loader.load_imdb_data(directory = 'train')
train_df = train_df.sample(frac=0.05, random_state = train_params.seed)
#take only 5%
print(train_df.shape)

test_df = Loader.load_imdb_data(directory = 'test')
print(test_df.shape)

corpus = train_df['review'].tolist()
target = train_df['sentiment'].tolist()
corpus, target = remove_empty_docs(corpus, target)
print(len(corpus))

preprocessor = Preprocess(corpus=corpus)
corpus_to_seq = preprocessor.fit()

test_corpus = test_df['review'].tolist()
test_target = test_df['sentiment'].tolist()
test_corpus, test_target = remove_empty_docs(test_corpus, test_target)
print(len(test_corpus))

test_corpus_to_seq = preprocessor.transform(test_corpus)

x_train = np.array(corpus_to_seq)
x_test = np.array(test_corpus_to_seq)

y_train = np.array(target)
y_test = np.array(test_target)

print(x_train.shape, y_train.shape)

glove=GloVe(50)
initial_embeddings = glove.get_embedding(preprocessor.word_index)

#IMDB MODEL

Now, let's load the trained model first. We have two methods to load: the hyperparameters of the model and the learned model weights in the DocumentModel class:

def load_model(file_name): 
with open(file_name, "r", encoding= "utf-8") as hp_file:
model_params = json.load(hp_file)
doc_model = DocumentModel( **model_params)
print(model_params)
return doc_model

def load_model_weights(self, model_weights_filename):
self._model.load_weights(model_weights_filename, by_name=True)

Then, we use the preceding methods to load the pretrained model, and then transfer the learned weights to the new model as follows. The embedding matrix for the pretrained model is bigger and has more words than the corpus. So, we cannot directly use the embedding matrix from the pretrained model. We will use the update_embedding method from the GloVe class to update the GloVe initialized embedding for the IMDB model with the embedding from the trained model:

amazon_review_model = DocumentModel.load_model("model_file.json")
amazon_review_model.load_model_weights("model_weights.hdf5")
learned_embeddings = amazon_review_model.get_classification_model()
.get_layer('embedding').get_weights()[0]

#update the GloVe embeddings.
glove.update_embeddings(preprocessor.word_index,
np.array(learned_embeddings),
amazon_review_model.word_index)

Now, we are all set to build the transfer learning model. Let's build the IMDB model first and initialize the weights from the other pretrained model. We will not train the lower layers of this network with this small amount of data. So, we will set trainable=False for them. We will be training only the final layers with large dropouts:

initial_embeddings = glove.get_embedding(preprocessor.word_index)#get  
updated embeddings

imdb_model = DocumentModel(vocab_size=preprocessor.get_vocab_size(),
word_index = preprocessor.word_index,
num_sentences=Preprocess.NUM_SENTENCES,
embedding_weights=initial_embeddings,
conv_activation = 'tanh',
train_embedding = False,
learn_word_conv = False,
learn_sent_conv = False,
hidden_dims=64,
input_dropout=0.0,
hidden_layer_kernel_regularizer=0.001,
final_layer_kernel_regularizer=0.01)

#transfer word & sentence conv filters
for l_name in ['word_conv','sentence_conv','hidden_0', 'final']:
imdb_model.get_classification_model()
.get_layer(l_name).set_weights(weights=amazon_review_model
.get_classification_model()
.get_layer(l_name).get_weights())

After training for a few epochs, only fine-tuning the hidden layer and the final sigmoid layer, we get 86% test accuracy on the 25k test set we have. If we try training an SVM model on this small dataset and predict for the entire 25k test set, we could get only 82% accuracy. So, clearly transfer learning is helping in building a better model even when we have less data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset