Summary

In this chapter, we described deep learning methods in speech recognition. We looked at an overview of speech recognition software currently used in practice. We showed that traditional HMM-based methods might need to incorporate specific language models, whereas neural network-based methods can learn end to end speech transcription entirely from data. This is one main advantage of neural network models over HMM models. We developed a basic spoken digits recognition model using TensorFlow. We then used the open spoken digits dataset to train and make predictions on a test set. This example provided the background of the tasks involved in a speech recognition system like extraction of the frequency spectrum like MFCC features from the raw audio data and converting the text transcripts to labels. We then introduced the DeepSpeech architecture from Baidu, which is one of the most recent popular models in transcribing speech. We then explained a complete implementation of the DeepSpeech model in TensorFlow and trained it on a subset of the LDC dataset. To explore further, the reader can tweak the model parameters and train it on a larger dataset.

We then briefly looked at the state-of-the-art in speech recognition, mainly on attention-based models. We looked at the attention-based model described in the Listen, Attend, and Spell (LAS) paper. While CTC-based models assume character independence, the LAS model does not make such assumptions. This was one of the main advantages of LAS over DeepSpeech-like models, as described by the authors. The interested reader may take a look at the PyTorch implementation of this model at https://github.com/XenderLiu/Listen-Attend-and-Spell-Pytorch.

In the next chapter, we will look at the reverse task of speech recognition, which is text to speech.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary