Vectorizing the data

The final stage of preprocessing the data is to vectorize or quantize our dialogs and candidates. This entails converting each word or token into an integer value, which implies that any sequence of words is now transformed into a sequence of integers corresponding to each word. 

We will first write a method to vectorize candidate texts. We also have to keep in mind a fixed word length (sentence_size) of each vectorized candidate. Hence, we need to pad (with 0s, which corresponds to empty words) those candidate vectors whose length is less than the required sentence size:

def vectorize_candidates(candidates, word_idx, sentence_size):
# Determine shape of final vector
shape = (len(candidates), sentence_size)
candidates_vector = []
for i, candidate in enumerate(candidates):
# Determine zero padding
zero_padding = max(0, sentence_size - len(candidate))
# Append to final vector
candidates_vector.append(
[word_idx[w] if w in word_idx else 0 for w in candidate]
+ [0] * zero_padding)
# Return as TensorFlow constant
return tf.constant(candidates_vector, shape=shape)

Next, we will write a method to vectorize our dialog data in a similar manner. Another important aspect we need to care about is to ensure that we pad the facts vector for each data sample with empty memories (vectors of 0s of sentence_size) to a fixed memory size:

def vectorize_data(data, word_idx, sentence_size, batch_size, max_memory_size):
facts_vector = []
questions_vector = []
answers_vector = []
# Sort data in descending order by number of facts
data.sort(key=lambda x: len(x[0]), reverse=True)
for i, (fact, question, answer) in enumerate(data):
# Find memory size
if i % batch_size == 0:
memory_size = max(1, min(max_memory_size, len(fact)))
# Build fact vector
fact_vector = []
for i, sentence in enumerate(fact, 1):
fact_padding = max(0, sentence_size - len(sentence))
fact_vector.append(
[word_idx[w] if w in word_idx else 0 for w in sentence]
+ [0] * fact_padding)
# Keep the most recent sentences that fit in memory
fact_vector = fact_vector[::-1][:memory_size][::-1]
# Pad to memory_size
memory_padding = max(0, memory_size - len(fact_vector))
for _ in range(memory_padding):
fact_vector.append([0] * sentence_size)
# Build question vector
question_padding = max(0, sentence_size - len(question))
question_vector = [word_idx[w] if w in word_idx else 0
for w in question]
+ [0] * question_padding
# Append to final vectors
facts_vector.append(np.array(fact_vector))
questions_vector.append(np.array(question_vector))
# Answer is already an integer corresponding to a candidate
answers_vector.append(np.array(answer))
return facts_vector, questions_vector, answers_vector

We emphasize knowing these dimensions beforehand because we will be sending these vectors to the TensorFlow model, which needs to know the sizes of its input to construct the model graph.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset