The attention-based decoder

Here, the decoder uses an attention mechanism on the encoder's output, to produce mel-spectrogram frames.

For each mini-batch, the decoder is given a GO frame, which contains only zeros as input. Then, for every time-step, the previously predicted mel-spectrogram frame is used as input. This input fuels a pre-net that has the exact same architecture as the encoder's pre-net.

The pre-net is followed by a one-layer GRU, whose output is concatenated with the encoder's output to produce the context vector through the attention mechanism. This GRU output is then concatenated with the context vector, to produce the input of the decoder RNN block.

The decoder RNN is a two-layer residual GRU that uses vertical residual connections, as shown by Wu et al. (https://arxiv.org/abs/1609.08144). In their paper, they use a type of residual connection that is a bit more complicated than what we presented a few paragraphs ago. Indeed, instead of just adding the output of the last layer to the initial input, at each layer, we add the layer's output to the layer's input, and we use the result of the addition as the input for the following layer:

The decoder RNN module produces r mel-spectrogram frames, and only the last one is used by the pre-net during the next time-step. The choice of generating r frames, instead of one at each time-step, is motivated by the fact that one character in the encoder input usually corresponds to multiple frames. Therefore, by outputting one frame, we force the model to attend to the same input element for multiple time-steps, thus slowing down attention during training. Values of r = 2 and r = 3 are mentioned in the paper. Increasing r also reduces the model size and decreases the inference time.

Table of Contents for The attention-based decoder

Create new playlist

Sign In

Sign Up

Table of Contents for
The attention-based decoder