Overview of the DeepSpeech model

The model consists of a stack of fully connected hidden layers followed by a bidirectional RNN and with additional hidden layers at the output. The first three nonrecurrent layers act like a preprocessing step to the RNN layer. One addition is the use of clipped rectified linear units (ReLUs) to prevent the activations from exploding. The input audio feature is the Mel cepstrum coefficients that the nonrecurrent layers see in time slices of spectrograms. In addition to the usual time slices, the spectrum data is preprocessed to include past and future contexts. The fourth layer is the RNN layer which has both a forward recurrence and a backward recurrence. The fifth layer takes the concatenated outputs of the forward and backward recurrence and produces an output that is fed to a final softmax layer that predicts the character probabilities. The following diagram shows the architecture of the model from the original paper. For more details on the architecture, you can take a look at https://arxiv.org/abs/1412.5567. The following diagram shows the hidden layers and the bidirectional recurrent layer denoted by the blue and red arrows:

The audio input with the context is also shown in the preceding diagram along with the time sliced MFCC features. We will now look at how to implement this model in TensorFlow. The complete Jupyter Notebook can be found under Chapter11/02_example.ipynb in this book's code repository. Before that, we will briefly take a look at the data we will be using for our model training.

Table of Contents for Overview of the DeepSpeech model

Create new playlist

Sign In

Sign Up

Table of Contents for
Overview of the DeepSpeech model