LSTM model for spoken digit recognition

For this example, we will use the tflearn package for simplicity. The tflearn package can be installed using the following command:

pip install tflearn

We will define the function to read the .wav files and prepare it for batch training:

def get_batch_mfcc(fpath,batch_size=256):
    ft_batch = []
    labels_batch = []
    files = os.listdir(fpath)
    while True:
        print("Total %d files" % len(files))
        random.shuffle(files)
        for fname in files:
            if not fname.endswith(".wav"): 
                continue
    mfcc_features = get_mfcc_features(fpath+fname) 
    label = np.eye(10)[int(fname[0])]
    labels_batch.append(label)
    ft_batch.append(mfcc_features)
    if len(ft_batch) >= batch_size:
        yield ft_batch, labels_batch 
    ft_batch = [] 
    labels_batch = []

In the get_batch_mfcc function, we read the .wav files and use the get_mfcc_features function defined earlier to extract the MFCC features. The label is one-hot encoded for the 10 digits from zero to nine. The function then returns the data in batches of 256 by default. Next, we will define the Long Short-Term Memory (LSTM) model:

train_batch = get_batch_mfcc('../../speech_dset/recordings/train/')
sp_network = tflearn.input_data([None, audio_features, utterance_length])
sp_network = tflearn.lstm(sp_network, 128*4, dropout=0.5)
sp_network = tflearn.fully_connected(sp_network, ndigits, activation='softmax')
sp_network = tflearn.regression(sp_network, optimizer='adam', learning_rate=lr, loss='categorical_crossentropy')
sp_model = tflearn.DNN(sp_network, tensorboard_verbose=0)
while iterations_train > 0:
    X_tr, y_tr = next(train_batch)
    X_test, y_test = next(train_batch)
    sp_model.fit(X_tr, y_tr, n_epoch=10, validation_set=(X_test, y_test), show_metric=True, batch_size=bsize)
    iterations_train-=1
sp_model.save("/tmp/speech_recognition.lstm")

The model basically consists of an LSTM layer followed by a fully connected layer. We use the categorical cross-entropy as the loss function with the Adam optimization. The model is trained on the batch inputs from the get_batch_mfcc function. We get the following output after 300 epochs:

Training Step: 1199  | total loss: 0.45749 | time: 0.617s
| Adam | epoch: 300  | loss: 0.45749 - acc: 0.8975 -- iter: 192/256

Now, we will make a prediction on the audio file from the test set. The test audio is for the spoken digit 4:

sp_model.load('/tmp/speech_recognition.lstm')
mfcc_features = get_mfcc_features('../../speech_dset/recordings/test/4_jackson_40.wav')
mfcc_features = mfcc_features.reshape((1,mfcc_features.shape[0],mfcc_features.shape[1]))
prediction_digit = sp_model.predict(mfcc_features)
print(prediction_digit)
print("Digit predicted: ", np.argmax(prediction_digit))



Output:
INFO:tensorflow:Restoring parameters from /tmp/speech_recognition.lstm
[[2.3709694e-03 5.1581711e-03 7.8898791e-04 1.9530311e-03 9.8459840e-01
 1.1394228e-03 3.0317350e-04 1.8992715e-03 1.6027489e-03 1.8592674e- 
 04]]
Digit predicted: 4

We load the trained model and get the features for the test audio file. As seen from the output, the model was able to predict the correct digit. We can also see the predicted probabilities for the 10 digits. Next, we will look at model visualization in TensorBoard.

Table of Contents for LSTM model for spoken digit recognition

Create new playlist

Sign In

Sign Up

Table of Contents for
LSTM model for spoken digit recognition