Selective attention

The controller output is used to figure out which location of memory to read from or write to. This is defined by a set of weights spread over all memory locations, which add up to 1. The weights are defined by the following two mechanisms. The idea is to give the controller several different modes of reading or writing to memory, corresponding to different data structures:

Content based: Compares the key k output of the controller with all the memory locations using a similarity measure, say, cosine similarity (S), and then all the distances are normalized by softmax to get weights that add up to 1:

In this, β ≥ 1 is called the sharpness parameter, and controls the focus on a particular location. It also gives the network a way to decide how precise it wants memory location access to be. It's like the fuzziness coefficient in fuzzy c-means clustering.

Location based: The location-based addressing mechanism is designed for simple iterations across memory locations. For example, if the current weighting focuses entirely on a single location, a rotation of 1 would shift the focus to the next location. A negative shift would move the weighting in the opposite direction. The controller outputs a shift kernel, s (say, a softmax on [-n,n]), which is convolved with the memory weighting calculated previously to produce a shifted memory location, as in the following diagram. The shift is circular; that is, it wraps around the borders. The following diagram is a heat map representation of memory—the darker shades represent more weights:

Before applying rotational shift, the weight vector given by content addressing is combined with the previous weight vector, w_t-1, as follows: . Here, g_t is a scalar interpolation gate in the range (0, 1), emitted by the controller head. If g_t=1, the weighting from the previous iteration is ignored.

Table of Contents for Selective attention

Create new playlist

Sign In

Sign Up

Table of Contents for
Selective attention