Memory module  

The magic of memory network models lies in their formulation of the memory module, which performs a soft attention mechanism over the fact embeddings. Literature on memory networks and other attention-based models introduces many different types of attention mechanisms, but all of them hinge on the concept of an element-wise dot product followed by a summation between two vectors as an operation measuring semantic or syntactic similarity. We will call it the reduce-dot operation, which receives two vectors and results in a single number denoting a similarity score.

We have formulated our attention mechanism as follows:

The context vector is used to encode all the information required to produce an output and is initialized as the question vector
The reduce-dot operation between each of the fact vectors and the context vector gives us similarity scores for each of the fact vectors
We then take a softmax over these similarity scores to normalize these scores into probability values between 0 and 1
For each fact vector, we then multiply each element of the vector by the similarity probability value for that fact
Finally, we take an element-wise sum of these weighted fact vectors to get a context representation where certain facts have higher importance than others
The context vector is then updated by element-wise adding this context representation to it

The updated context vector is used to attend to the fact vectors and is subsequently updated further using multiple passes over the facts, termed hops

The steps can be understood through an expanded view of the memory module:

We must use NumPy-like atomic operations in TensorFlow to write our memory module, as there is no high-level wrapper in the API for performing reduce-dot operations:

    def _memory_module(self, questions_emb, facts_emb):
        with tf.variable_scope("MemoryModule"):
            initial_context_vector = questions_emb
            context_vectors = [initial_context_vector]
            # Multi-hop attention over facts to update context vector
            for hop in range(self._hops):
                # Perform reduce_dot
                context_temp = tf.transpose(
                    tf.expand_dims(context_vectors[-1], -1), [0, 2, 1])
                similarity_scores = tf.reduce_sum(
                    facts_emb * context_temp, 2)
                # Calculate similarity probabilities
                probs = tf.nn.softmax(similarity_scores)
                # Perform attention multiplication
                probs_temp = tf.transpose(tf.expand_dims(probs, -1), 
                                          [0, 2, 1])
                facts_temp = tf.transpose(facts_emb, [0, 2, 1])
                context_rep = tf.reduce_sum(facts_temp*probs_temp, 2)
                # Update context vector
                context_vector = tf.matmul(context_vectors[-1], 
                                           self.transformation_matrix) 
                                 + context_rep
                # Append to context vector list to use in next hop
                context_vectors.append(context_vector)
            # Return context vector for last hop
            return context_vector

Each hop may attend to different aspects of the facts, and using such a multi-hop attention mechanism leads to richer context vectors. The mechanism allows models to understand and reason over the facts in a step-by-step manner, as we can see by visualizing the values of the similarity probabilities at each hop:

Table of Contents for Memory module&#xA0;&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Memory module