Speech recognition

In the past few decades, there has been a tremendous amount of research on leveraging deep learning for speech-related applications. Speech recognition has become a part of many day-to-day applications, such as our phones, smartwatches, homes, games, and many more.

It's being implemented as a salient feature in many voice search applications such as Siri and Alexa by tech giants such as Apple and Amazon, respectively. Sound waves are time-domain signals, which means that when we plot a sound wave, one of the axes is time (independent variable) and the other is the amplitude of the wave (dependent variable).

To create a digital recording of the sound wave, we convert the analog sound signal into a digital form by performing sampling. Sampling converts the analog audio signal into a digital signal by taking measurements of the dependent variable at a regular time interval called the sampling interval. A small sampling interval results in a better quality sound. To describe the quality of a recorded sound, we often use a term sampling rate as opposed to the sampling interval. The sampling rate defines the number of samples that are taken per second from an analog sound wave.

The sampling rate can be expressed as follows:

The time-domain representation of sound is not always the best. The most distinguished information is hidden in the frequency spectrum of the signal. Mathematical transformations such as Fourier Transform (FT) are used to transform a sound wave into its frequency domain. When we apply a Fourier Transform to a signal in a time domain, we obtain its frequency-amplitude representation. Since the digital recording of sound is a discrete process in time, we use Discrete Fourier Transform (DFT) to transform it into its frequency domain. Fast Fourier Transform (FFT) is an algorithm that's used to compute the FT quickly. Computing FFT on the entire sound is not informative enough. To extract more information, we need to use the Short-Time Fourier Transform (STFT). In STFT, we slide a window across the signal and compute a DFT at each sliding window to calculate the magnitude of the frequency spectrum.

Table of Contents for Speech recognition

Create new playlist

Sign In

Sign Up

Table of Contents for
Speech recognition