Building a vocabulary for word embedding lookup

We want to create word embeddings for each of the words in our facts, candidates, and questions. Hence, we need to read our data and candidates to calculate the number of words to create embeddings for, as well as maximum sentence lengths. This information is passed to the memory network model to initialize the embedding matrices and input placeholders:

    def build_vocab(self, data, candidates):
# Build word vocabulary set from all data and candidate words
vocab = reduce(lambda x1, x2: x1 | x2,
(set(list(chain.from_iterable(facts)) + questions)
for facts, questions, answers in data))
vocab |= reduce(lambda x1, x2: x1 | x2,
(set(candidate) for candidate in candidates))
vocab = sorted(vocab)
# Assign integer indices to each word
self.word_idx = dict((word, idx + 1) for idx, word in enumerate(vocab))
# Compute various data size numbers
max_facts_size = max(map(len, (facts for facts, _, _ in data)))
self.sentence_size = max(
map(len, chain.from_iterable(facts for facts, _, _ in data)))
self.candidate_sentence_size = max(map(len, candidates))
question_size = max(map(len, (questions for _, questions, _ in data)))
self.memory_size = min(self.memory_size, max_facts_size)
self.vocab_size = len(self.word_idx) + 1 # +1 for null word
self.sentence_size = max(question_size, self.sentence_size)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset