The encoder

The encoder takes a sequence of characters as input, where each character is represented by a one-hot vector. An embedder is then used to project the input into a continuous space. Remember that, because of their high-dimensionality and their sparsity, one-hot encoded vectors can lead to computational inefficiency, if they are not used with techniques that exploit these characteristics. An embedder allows for significantly reducing the size of the representation space. Besides, using an embedder allows us to learn about the relationships between the different characters of our vocabulary.

The embedding layer is then followed by a pre-net, which is a set of non-linear transformations. Basically, it is comprised of two consecutive, fully-connected layers, with rectified linear unit (ReLU) activation and dropout. Dropout (http://jmlr.org/papers/v15/srivastava14a.html) is a regularization technique that ignores some randomly selected units (or neurons) during training, in order to avoid overfitting. Indeed, during training, some neurons can develop a relationship of codependency that can result in overfitting. With dropout, the neural network tends to learn more robust features.

The second FC layer has two times fewer units than the first one. It is a bottleneck layer that helps with convergence and improves generalization.

The Tacotron team uses a module called CBHG on top of the pre-net. The name of this module comes from its building blocks: a 1-D convolution bank (CB), followed by a highway network (H) and a bidirectional GRU (G).

K layers of 1-D convolutional filters are used to form the convolution bank. The layer of index, K, contains C_k filters of the width k (k = 1, 2, ..., K). With this structure, we should be able to model unigrams, bigrams, and so on.

Max pooling is used right after the convolution bank. It is a down-sampling technique that is commonly applied in Convolutional Neural Networks (CNNs). It has the advantage of making the learned features locally invariant. Here, it is used with a stride of 1, in order to maintain the time resolution.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (https://arxiv.org/abs/1502.03167) is utilized by all of the convolutional layers in the CBHG. It improves the performance as well as the stability of neural networks. It modifies the input of a layer, so that the output given by the activation has a mean of zero and a standard deviation of one. It is known to help the network converge faster, to allow for higher learning rates (thus, making bigger steps toward a good minimum) in deep networks, alleviating the sensitivity of the network to weight initialization and to add extra regularization by providing some noise.

Following the max pooling layer, two supplementary 1-D convolutional layers are used (with ReLU and linear activations, respectively). A residual connection ( https://arxiv.org/abs/1512.03385) binds the initial input with the output of the second convolutional layer. Deep networks allow for capturing more complexity in the data, and have the potential to perform better than shallower networks on a given task. But, in general, the gradient tends to vanish with very deep networks. Residual connections allow for a better propagation of the gradient.

Thus, they significantly improve the training of deep models:

The next block of the CBHG module, a highway network (https://arxiv.org/abs/1505.00387), has a similar role.

To finalize the CBHG module, as well as the encoder, a bidirectional GRU is used to learn the long-term dependencies in the sequence, from both the forward and the backward context. The encoder output will then be used by the attention layer:

Table of Contents for The encoder

Create new playlist

Sign In

Sign Up

Table of Contents for
The encoder