Transfer learning using GloVe embeddings

Global Vectors (GloVe) uses global word-word co-occurrences statistics in a large text corpus to arrive at dense vector representations of words. It is an unsupervised learning method with the objective of making the dot product of the learned vectors equal to the logarithm of the probability of co-occurrences. This translates to differences in vectors in the embedding space as the ratio of logarithms of ratio equals to the difference in logarithms.

For this example, we will use the GloVe embedding, which is pre-trained on Twitter data. This data consists of around 2 billion tweets with a vocabulary size of 1.2 million. For the classification task, we use the customer reviews or ratings of Amazon instant videos. First, we must load the reviews data in JSON format and convert it to a pandas DataFrame, as shown in the following code:

json_data = []
with gzip.open('data/reviews_Amazon_Instant_Video_5.json.gz', 'rb') as json_file:
for json_str in json_file:
json_data.append(json.loads(json_str))
reviews_df = pd.DataFrame.from_records(json_data)

For our classification task, we are only interested in the reviewText fields and overall columns which represent the user review text and rating, respectively. We print the sample data to see the ratings and review text, as shown in the following code:

reviews_df[['overall','reviewText']].head(5)

The following is the output of the preceding code:

overall reviewText

9102

4.0

I enjoy this show because it shows old stuff t...

22551

5.0

I really enjoy these programs. I am a pretty g...

33811

3.0

Decent cast, but it seems a little formulaic...

8127

5.0

My kids love this show. It's one of my 3 year...

17544

5.0

How they keep coming up with the shenanigans t...

 

The word vectors are then loaded from the GloVe embeddings text file using the following code:

def build_word_vector_matrix(vector_file):
np_arrays = []
labels_array = []
with codecs.open(vector_file, 'r', 'utf-8') as f:
for i, line in enumerate(f):
sr = line.split()
if(len(sr)<26):
continue
labels_array.append(sr[0])
np_arrays.append(np.array([float(j) for j in sr[1:]]))
return np.array(np_arrays), labels_array

Each line in the file consists of the word followed by its corresponding vector representation. This is read into the numpy array np_arrays and the corresponding word is read into the labels_array. The build_word_vector_matrix, when called, returns the word token and the word vector arrays. Like before, we transform the reviews data to fixed length sentences of maximum length equal to the average size of all reviews with TensorFlow's vocabulary_processor. We pass the review DataFrame to the vocabulary_processor to achieve this, as shown in the following code:

lencoder = LabelEncoder()
voc_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(average_review_size)
voc_processor.fit(vocabulary)
X_transform = voc_processor.transform(reviews_df.reviewText)
X_transform = np.array(list(X_transform))
y = lencoder.fit_transform(reviews_df.overall.values)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_transform,
y, test_size=0.2, random_state=42)
n_words = len(voc_processor.vocabulary_)
n_classes = len(lencoder.classes_)

The model function for the estimator uses an RNN, as in our previous example, except that we set the Trainable=False for the input embedding variable word_embeddings. This ensures that during the model's training, the embeddings are not learned again. Note that the word_embeddings variable contains the lookup table for the GloVe embeddings. The code that follows shows the function utilized for creating the RNN model:

def rnn_model_fn(features,labels,mode):
em_plholder = tf.placeholder(tf.float32, [voc_size, WD_EMB_SIZE])
Wt = tf.Variable(em_plholder,trainable=False, name='Wt')
comments_word_vec = tf.nn.embedding_lookup(Wt, features[REVIEW_FT])
comments_wd_l = tf.unstack(comments_word_vec, axis=1)
rnn_cell = tf.nn.rnn_cell.GRUCell(WD_EMB_SIZE)
_, comments_encoding = tf.nn.static_rnn(rnn_cell, comments_wd_l, dtype=tf.float32)
dense = tf.layers.dense(comments_encoding, units=512, activation=tf.nn.relu)
dropout = tf.layers.dropout(inputs=dense, rate=0.4,training=(mode==tf.estimator.ModeKeys.TRAIN))
logits = tf.layers.dense(inputs=dropout, units=n_classes)
return get_estimator_spec(input_logits=logits, out_lb=labels, train_predict_m=mode,
embedding_placeholder=em_plholder)

Like we did previously, we can view the model graph and training in TensorBoard. We can see the steady decrease in the learning rate with the step count:

Training loss and steps/sec

The following graph shows the network with word_embedding input, RNN layer, FC layer, and finally the softmax output:

Model with GloVe embeddings
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset