Preprocessing the audio data

The MFCC features are extracted from the audio data just like in our previous example. In addition to that, we also add the context that was used in the original paper:

def audiofile_to_vector(audio_fname, n_mfcc_features, nctx):
    sampling_rate, raw_w = wavfile.read(audio_fname)
    mfcc_ft = mfcc(raw_w, samplerate=sampling_rate, numcep=n_mfcc_features)
    mfcc_ft = mfcc_ft[::2]
    n_strides = len(mfcc_ft)
    dummy_ctx = np.zeros((nctx, n_mfcc_features), dtype=mfcc_ft.dtype)
    mfcc_ft = np.concatenate((dummy_ctx, mfcc_ft, dummy_ctx))
    w_size = 2*nctx+1
    input_vector = np.lib.stride_tricks.as_strided(mfcc_ft,(n_strides, w_size,
                    n_mfcc_features,(mfcc_ft.strides[0], mfcc_ft.strides[0], mfcc_ft.strides[1]),
                    writeable=False)
    input_vector = np.reshape(input_vector, [n_strides, -1])
    input_vector = np.copy(input_vector)
    input_vector = (input_vector - np.mean(input_vector))/np.std(input_vector)
    return input_vector

We first read the .wav file and extract the MFCC features mfcc_ft. A dummy context is added to the front and back of each time slice using the NumPy as_strided function. The audiofile_to_vector functions finally returns the MFCC features with the past and future context. Next, we will look at how we extract the source and transcribed text from the dataset for our batch training:

def get_wav_trans(fpath,X, y):
    files = os.listdir(fpath)
    for fname in files:
        next_path = fpath + "/" + fname
        if os.path.isdir(next_path):
            get_wav_trans(next_path,X,y)
        else:
        if fname.endswith('wav'):
            fname_without_ext = fname.split(".")[0]
            trans_fname = fname_without_ext + ".txt"
            trans_fname_path = fpath + "/" + trans_fname
            if os.path.isfile(trans_fname_path):
                mfcc_ft = audiofile_to_vector(next_path,n_inp,n_ctx)
                with open(trans_fname_path,'r') as content:
                    transcript = content.read()
                    transcript = re.sub(regexp_alphabets, ' ', transcript).strip().lower()
            trans_lbl = get_string2label(transcript)
        X.append(mfcc_ft)
        y.append(trans_lbl)

We recurse through the provided path, fpath, to extract the MFCC features of all the wav files. The function described before audio_file_to_vector extracts the MFCC features of each wav file we find and the corresponding transcribed text is read from the text file. The regular expression regexp_alphabets removes non-alphabetic characters from the transcribed text. The raw text transcript is passed to the get_string2label function that converts the text to integer labels that we use as target labels for the model training:

regexp_alphabets = "[^a-zA-Z']+"
cnt=0
def get_label(ch):
    global cnt
    label = cnt
    cnt+=1
    return label
chr2lbl = {c:get_label(c) for c in list(chars)}
lbl2chr = {chr2lbl[c]:c for c in list(chars)}
def get_string2label(strval):
    strval = strval.lower()
    idlist = []
    for c in list(strval):
        if c in chr2lbl:
            idlist.append(chr2lbl[c])
    return np.array(idlist)

def get_label2string(lblarr):
    strval = []
    for idv in lblarr:
        strval.append(lbl2chr[idv])
    return ''.join(strval)

The get_string2label basically converts a string to a list of integer labels that are obtained with the chr2lbl dictionary, which maps characters from a-z (additionally with a space and an apostrophe) to integer values. Similarly, we use the get_label2string function to convert the list of labels (with the reverse mapping lbl2chr) to the original string. Next, we will look at creating the DeepSpeech model.

Table of Contents for Preprocessing the audio data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing the audio data