Preparation of audio data

In this section, many terms from the signal processing world will be used. We will explain some of them, and we don't expect the reader to know about the others. This book is about deep learning per se, and we encourage curious readers to perform their own research on the signal processing notions that are not thoroughly explained.

We will use the following parameters (they are obtained from the paper) for the spectral analysis and processing of the .wav files:

Variable Name

Description

Value

N_FFT

Number of Fourier transform points

1024

PREEMPHASIS

Parameter of the pre-emphasis technique that gives more importance to high frequencies in the signal 

0.97

SAMPLING_RATE

Sampling rate

16000

WINDOW_TYPE

Type of window used to compute the Fourier transform

'hann'

FRAME_LENGTH

Length of the window

50 ms

FRAME_SHIFT

Temporal shift 

12.5 ms

N_melS

Number of mel bands

80

r

Reduction factor

5

Even though the native sampling rate of the audio signals is 22.05 kHz, we have decided to use 16 kHz, to reduce the number of computational operations. Besides, the suggested number of Fourier transform points in the paper is 2,048. Here, we use 1,024 points, to make the task even lighter less burdensome.

The Tacotron model optimizes two objective functions: one for the mel-spectrogram output that comes out of the decoder RNN, and the other for the spectrogram output that is given by the postprocessing CBHG applied on the mel-spectrogram. Therefore, we need to prepare these two types of output.

First, from the file path of a .wav file, we return both the spectrogram and the mel-spectrogram of the signal:

def get_spectros(filepath, preemphasis, n_fft,
hop_length, win_length,
sampling_rate, n_mel,
ref_db, max_db):
waveform, sampling_rate = librosa.load(filepath,
sr=sampling_rate)

waveform, _ = librosa.effects.trim(waveform)

# use pre-emphasis to filter out lower frequencies
waveform = np.append(waveform[0],
waveform[1:] - preemphasis * waveform[:-1])

# compute the stft
stft_matrix = librosa.stft(y=waveform,
n_fft=n_fft,
hop_length=hop_length,
win_length=win_length)

# compute magnitude and mel spectrograms
spectro = np.abs(stft_matrix)

mel_transform_matrix = librosa.filters.mel(sampling_rate,
n_fft,
n_mel,
htk=True)
mel_spectro = np.dot(mel_transform_matrix,
spectro)

# Use the decidel scale
mel_spectro = 20 * np.log10(np.maximum(1e-5, mel_spectro))
spectro = 20 * np.log10(np.maximum(1e-5, spectro))

# Normalise the spectrograms
mel_spectro = np.clip((mel_spectro - ref_db + max_db) / max_db, 1e-8, 1)
spectro = np.clip((spectro - ref_db + max_db) / max_db, 1e-8, 1)

# Transpose the spectrograms to have the time as first dimension
# and the frequency as second dimension
mel_spectro = mel_spectro.T.astype(np.float32)
spectro = spectro.T.astype(np.float32)

return mel_spectro, spectro

Then, we pad the time dimension of the spectrogram if its total length is not a multiple of r, so that it becomes a multiple of r:

def get_padded_spectros(filepath):
filename = os.path.basename(filepath)
mel_spectro, spectro = get_spectros(filepath)
t = mel_spectro.shape[0]
nb_paddings = r - (t % r) if t % r != 0 else 0 # for reduction
mel_spectro = np.pad(mel_spectro,
[[0, nb_paddings], [0, 0]],
mode="constant")
spectro = np.pad(spectro,
[[0, nb_paddings], [0, 0]],
mode="constant")
return filename, mel_spectro.reshape((-1, N_mel * r)), spectro

 get_padded_spectros is applied to all of the .wav files of the dataset through the  1_create_audio_dataset.py script. It then generates all of the spectrograms and mel-spectrograms as arrays, as well as the decoder's input. The three arrays are then split into training and testing datasets, in the same method used for the processed text data.

Note that running the script can take quite a long time (up to a few hours). That is why the resulting data is also pickled, so that we don't need to reprocess the files every single time we want to train a model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset