Full architecture, with attention

Now, let's combine the previously defined functions to form the full Tacotron model.

But first, let's define some extra parameters that characterize the network:

NB_CHARS_MAX = 200 # maximum length of the input text

K1 = 16 # number of 1-D convolution blocks in the encoder CBHGH
K2 = 8 # number of 1-D convolution blocks in the postprocessing CBHG

Note that the model is defined by two input objects and two output objects.

The two input objects correspond to the encoder input and the decoder input. The former is expected to be the input text. The latter should be the last mel-spectrogram frame, among the r frames predicted by the decoder before the postprocessing CBHG. The first frame of the decoder input is full of zeros, as shown in the paper.

The output corresponds to the mel-scaled spectrogram predicted by the decoder RNN, and the spectrogram predicted by the postprocessing CBHG module:

def get_tacotron_model(n_mels, r, k1, k2, nb_char_max,
embedding_size, mel_time_length,
mag_time_length, n_fft,
# Encoder:
input_encoder = Input(shape=(nb_char_max,))

embedded = Embedding(input_dim=len(vocabulary),
prenet_encoding = get_pre_net(embedded)

cbhg_encoding = get_CBHG_encoder(prenet_encoding,

# Decoder-part1-Prenet:
input_decoder = Input(shape=(None, n_mels))
prenet_decoding = get_pre_net(input_decoder)
attention_rnn_output = get_attention_RNN()(prenet_decoding)

# Attention
attention_rnn_output_repeated = RepeatVector(

attention_context = get_attention_context(cbhg_encoding,

context_shape1 = int(attention_context.shape[1])
context_shape2 = int(attention_context.shape[2])
attention_rnn_output_reshaped = Reshape((context_shape1,

# Decoder-part2:
input_of_decoder_rnn = concatenate(
[attention_context, attention_rnn_output_reshaped])
input_of_decoder_rnn_projected = Dense(256)(input_of_decoder_rnn)

output_of_decoder_rnn = get_decoder_RNN_output(

# mel_hat=TimeDistributed(Dense(n_mels*r))(output_of_decoder_rnn)
mel_hat = Dense(mel_time_length * n_mels * r)(output_of_decoder_rnn)
mel_hat_ = Reshape((mel_time_length, n_mels * r))(mel_hat)

def slice(x):
return x[:, :, -n_mels:]

mel_hat_last_frame = Lambda(slice)(mel_hat_)
post_process_output = get_CBHG_post_process(mel_hat_last_frame,

z_hat = Dense(mag_time_length * (1 + n_fft // 2))(post_process_output)
z_hat_ = Reshape((mag_time_length, (1 + n_fft // 2)))(z_hat)

model = Model(inputs=[input_encoder, input_decoder],
outputs=[mel_hat_, z_hat_])
return model

We can then compile the model. Since two output objects are defined, we need two loss functions. In the paper, two l1 losses are picked, with equal weights. We have decided to do the same. Besides, Adam is used as an optimizer, with its parameters, by default. We decided to not follow the learning rate schedule used in the paper, to keep things simple. We encourage the reader to try more advanced settings: 

opt = Adam()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.