How it works...

In Step 1, we prepared a dataset for the generator. We created a DataFrame that contains the file path and the class label of the audio file. In the next step, we created stratified samples and split the DataFrame into train, test, and validation sets. In Step 3, we created our convolutional neural network and compiled it.

In Step 4, we built a data generator and created training and validation generators. The generator function reads the audio files from disk, transforms each signal into its frequency-amplitude representation, and outputs the data in batches. We know that the speech commands dataset is sampled at 16 kHz; that is, for a 1-second recording, there are 16,000 samples. The dataset also contains a few audio files that are shorter or longer than 1 second. To accommodate recordings of varying lengths, we padded/truncated the audio data to an array that's 16,000 in length and applied STFT. In the Getting ready section of this recipe, we observed that the STFT of the 1-second recording sampled at 16 kHz results in an array that's 256x51 in size (fft_size * num_fft_windows). This is the reason we defined an input shape of fft_size * num_fft_windows for the first convolutional layer of our model.

In Step 5, we defined the model callbacks. In the last step, we trained our model and tested its prediction on a test sample.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...