Vectorizing the data

The final stage of preprocessing the data is to vectorize or quantize our dialogs and candidates. This entails converting each word or token into an integer value, which implies that any sequence of words is now transformed into a sequence of integers corresponding to each word.

We will first write a method to vectorize candidate texts. We also have to keep in mind a fixed word length (sentence_size) of each vectorized candidate. Hence, we need to pad (with 0s, which corresponds to empty words) those candidate vectors whose length is less than the required sentence size:

def vectorize_candidates(candidates, word_idx, sentence_size):
    # Determine shape of final vector
    shape = (len(candidates), sentence_size)
    candidates_vector = []
    for i, candidate in enumerate(candidates):
        # Determine zero padding
        zero_padding = max(0, sentence_size - len(candidate))
        # Append to final vector
        candidates_vector.append(
            [word_idx[w] if w in word_idx else 0 for w in candidate] 
            + [0] * zero_padding)
    # Return as TensorFlow constant
    return tf.constant(candidates_vector, shape=shape)

Next, we will write a method to vectorize our dialog data in a similar manner. Another important aspect we need to care about is to ensure that we pad the facts vector for each data sample with empty memories (vectors of 0s of sentence_size) to a fixed memory size:

def vectorize_data(data, word_idx, sentence_size, batch_size, max_memory_size):
    facts_vector = []
    questions_vector = []
    answers_vector = []
    # Sort data in descending order by number of facts
    data.sort(key=lambda x: len(x[0]), reverse=True)
    for i, (fact, question, answer) in enumerate(data):
        # Find memory size
        if i % batch_size == 0:
            memory_size = max(1, min(max_memory_size, len(fact)))
        # Build fact vector
        fact_vector = []
        for i, sentence in enumerate(fact, 1):
            fact_padding = max(0, sentence_size - len(sentence))
            fact_vector.append(
                [word_idx[w] if w in word_idx else 0 for w in sentence] 
                + [0] * fact_padding)
        # Keep the most recent sentences that fit in memory
        fact_vector = fact_vector[::-1][:memory_size][::-1]
        # Pad to memory_size
        memory_padding = max(0, memory_size - len(fact_vector))
        for _ in range(memory_padding):
            fact_vector.append([0] * sentence_size)
        # Build question vector
        question_padding = max(0, sentence_size - len(question))
        question_vector = [word_idx[w] if w in word_idx else 0 
                           for w in question] 
                           + [0] * question_padding
        # Append to final vectors
        facts_vector.append(np.array(fact_vector))
        questions_vector.append(np.array(question_vector))
        # Answer is already an integer corresponding to a candidate
        answers_vector.append(np.array(answer))
    return facts_vector, questions_vector, answers_vector

We emphasize knowing these dimensions beforehand because we will be sending these vectors to the TensorFlow model, which needs to know the sizes of its input to construct the model graph.

Table of Contents for Vectorizing the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Vectorizing the data