Memory module  

The magic of memory network models lies in their formulation of the memory module, which performs a soft attention mechanism over the fact embeddings. Literature on memory networks and other attention-based models introduces many different types of attention mechanisms, but all of them hinge on the concept of an element-wise dot product followed by a summation between two vectors as an operation measuring semantic or syntactic similarity. We will call it the reduce-dot operation, which receives two vectors and results in a single number denoting a similarity score.

We have formulated our attention mechanism as follows:

  1. The context vector is used to encode all the information required to produce an output and is initialized as the question vector
  2. The reduce-dot operation between each of the fact vectors and the context vector gives us similarity scores for each of the fact vectors
  3. We then take a softmax over these similarity scores to normalize these scores into probability values between 0 and 1
  4. For each fact vector, we then multiply each element of the vector by the similarity probability value for that fact
  5. Finally, we take an element-wise sum of these weighted fact vectors to get a context representation where certain facts have higher importance than others
  6. The context vector is then updated by element-wise adding this context representation to it
  1. The updated context vector is used to attend to the fact vectors and is subsequently updated further using multiple passes over the facts, termed hops

The steps can be understood through an expanded view of the memory module:

We must use NumPy-like atomic operations in TensorFlow to write our memory module, as there is no high-level wrapper in the API for performing reduce-dot operations:

    def _memory_module(self, questions_emb, facts_emb):
with tf.variable_scope("MemoryModule"):
initial_context_vector = questions_emb
context_vectors = [initial_context_vector]
# Multi-hop attention over facts to update context vector
for hop in range(self._hops):
# Perform reduce_dot
context_temp = tf.transpose(
tf.expand_dims(context_vectors[-1], -1), [0, 2, 1])
similarity_scores = tf.reduce_sum(
facts_emb * context_temp, 2)
# Calculate similarity probabilities
probs = tf.nn.softmax(similarity_scores)
# Perform attention multiplication
probs_temp = tf.transpose(tf.expand_dims(probs, -1),
[0, 2, 1])
facts_temp = tf.transpose(facts_emb, [0, 2, 1])
context_rep = tf.reduce_sum(facts_temp*probs_temp, 2)
# Update context vector
context_vector = tf.matmul(context_vectors[-1],
self.transformation_matrix)
+ context_rep
# Append to context vector list to use in next hop
context_vectors.append(context_vector)
# Return context vector for last hop
return context_vector

Each hop may attend to different aspects of the facts, and using such a multi-hop attention mechanism leads to richer context vectors. The mechanism allows models to understand and reason over the facts in a step-by-step manner, as we can see by visualizing the values of the similarity probabilities at each hop:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset