Architecture

The architecture of WaveNet neural networks shows amazing outputs by generating audio and text-to-speech translations, since it directly produces a raw audio waveform.

When the previous samples and additional parameters are given as the input, the network produces the next sample in the form of an audio waveform using conditional probability.

The waveform that is given as the input is quantized to a fixed range of integers. This happens after the audio is preprocessed. The tensors are produced by one-hot encoding these integer amplitudes. Hence, the dimensions of the channel are reduced by the convolutional layer that only accesses the current and previous inputs.

The following diagram displays the WaveNet architecture:

A stack of causal dilated layers is used to build the network core. Each layer is a dilated convolution with holes, and it accesses only the past and current audio samples.

Then, the outputs that are received from all the layers are combined and, using an array of dense postprocessing layers, they are fed to the original channels. Later, the softmax function converts the output into a categorical distribution.

The loss function is calculated as the cross entropy between the output for each time step and the input at the next time step.

Table of Contents for Architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Architecture