Preparation of audio data

In this section, many terms from the signal processing world will be used. We will explain some of them, and we don't expect the reader to know about the others. This book is about deep learning per se, and we encourage curious readers to perform their own research on the signal processing notions that are not thoroughly explained.

We will use the following parameters (they are obtained from the paper) for the spectral analysis and processing of the .wav files:

Variable Name	Description	Value
`N_FFT`	Number of Fourier transform points	1024
`PREEMPHASIS`	Parameter of the pre-emphasis technique that gives more importance to high frequencies in the signal	0.97
`SAMPLING_RATE`	Sampling rate	16000
`WINDOW_TYPE`	Type of window used to compute the Fourier transform	'hann'
`FRAME_LENGTH`	Length of the window	50 ms
`FRAME_SHIFT`	Temporal shift	12.5 ms
`N_melS`	Number of mel bands	80
`r`	Reduction factor	5

Even though the native sampling rate of the audio signals is 22.05 kHz, we have decided to use 16 kHz, to reduce the number of computational operations. Besides, the suggested number of Fourier transform points in the paper is 2,048. Here, we use 1,024 points, to make the task even lighter less burdensome.

The Tacotron model optimizes two objective functions: one for the mel-spectrogram output that comes out of the decoder RNN, and the other for the spectrogram output that is given by the postprocessing CBHG applied on the mel-spectrogram. Therefore, we need to prepare these two types of output.

First, from the file path of a .wav file, we return both the spectrogram and the mel-spectrogram of the signal:

def get_spectros(filepath, preemphasis, n_fft,
                 hop_length, win_length,
                 sampling_rate, n_mel,
                 ref_db, max_db):
    waveform, sampling_rate = librosa.load(filepath,
                                           sr=sampling_rate)

    waveform, _ = librosa.effects.trim(waveform)

    # use pre-emphasis to filter out lower frequencies
    waveform = np.append(waveform[0],
                         waveform[1:] - preemphasis * waveform[:-1])

    # compute the stft
    stft_matrix = librosa.stft(y=waveform,
                               n_fft=n_fft,
                               hop_length=hop_length,
                               win_length=win_length)

    # compute magnitude and mel spectrograms
    spectro = np.abs(stft_matrix)

    mel_transform_matrix = librosa.filters.mel(sampling_rate,
                                               n_fft,
                                               n_mel,
                                               htk=True)
    mel_spectro = np.dot(mel_transform_matrix,
                         spectro)

    # Use the decidel scale
    mel_spectro = 20 * np.log10(np.maximum(1e-5, mel_spectro))
    spectro = 20 * np.log10(np.maximum(1e-5, spectro))

    # Normalise the spectrograms
    mel_spectro = np.clip((mel_spectro - ref_db + max_db) / max_db, 1e-8, 1)
    spectro = np.clip((spectro - ref_db + max_db) / max_db, 1e-8, 1)

    # Transpose the spectrograms to have the time as first dimension
    # and the frequency as second dimension
    mel_spectro = mel_spectro.T.astype(np.float32)
    spectro = spectro.T.astype(np.float32)

    return mel_spectro, spectro

Then, we pad the time dimension of the spectrogram if its total length is not a multiple of r, so that it becomes a multiple of r:

def get_padded_spectros(filepath):
    filename = os.path.basename(filepath)
    mel_spectro, spectro = get_spectros(filepath)
    t = mel_spectro.shape[0]
    nb_paddings = r - (t % r) if t % r != 0 else 0 # for reduction
    mel_spectro = np.pad(mel_spectro,
                         [[0, nb_paddings], [0, 0]],
                         mode="constant")
    spectro = np.pad(spectro,
                     [[0, nb_paddings], [0, 0]],
                     mode="constant")
    return filename, mel_spectro.reshape((-1, N_mel * r)), spectro

get_padded_spectros is applied to all of the .wav files of the dataset through the 1_create_audio_dataset.py script. It then generates all of the spectrograms and mel-spectrograms as arrays, as well as the decoder's input. The three arrays are then split into training and testing datasets, in the same method used for the processed text data.

Note that running the script can take quite a long time (up to a few hours). That is why the resulting data is also pickled, so that we don't need to reprocess the files every single time we want to train a model.

Table of Contents for Preparation of audio data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparation of audio data