Building a vocabulary for word embedding lookup

We want to create word embeddings for each of the words in our facts, candidates, and questions. Hence, we need to read our data and candidates to calculate the number of words to create embeddings for, as well as maximum sentence lengths. This information is passed to the memory network model to initialize the embedding matrices and input placeholders:

    def build_vocab(self, data, candidates):
        # Build word vocabulary set from all data and candidate words
        vocab = reduce(lambda x1, x2: x1 | x2, 
            (set(list(chain.from_iterable(facts)) + questions) 
                for facts, questions, answers in data))
        vocab |= reduce(lambda x1, x2: x1 | x2, 
            (set(candidate) for candidate in candidates))
        vocab = sorted(vocab)
        # Assign integer indices to each word
        self.word_idx = dict((word, idx + 1) for idx, word in enumerate(vocab))
        # Compute various data size numbers
        max_facts_size = max(map(len, (facts for facts, _, _ in data)))
        self.sentence_size = max(
            map(len, chain.from_iterable(facts for facts, _, _ in data)))
        self.candidate_sentence_size = max(map(len, candidates))
        question_size = max(map(len, (questions for _, questions, _ in data)))
        self.memory_size = min(self.memory_size, max_facts_size)
        self.vocab_size = len(self.word_idx) + 1 # +1 for null word
        self.sentence_size = max(question_size, self.sentence_size)

Table of Contents for Building a vocabulary for word embedding lookup

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a vocabulary for word embedding lookup