Feature engineering and representation of audio events

To build a robust classification model, we need robust and good feature representations from our raw audio data. We will leverage some of the techniques learned in the previous section for feature engineering. The code snippets used in this section are also available in the Feature Engineering.ipynb Jupyter Notebook, in case you want to run the examples yourself. We will reuse all the libraries we previously imported and we will also leverage joblib here to save our features to disk:

from sklearn.externals import joblib

Next, we will load up all our file names and define some utility functions to read in audio data and also enable us to get window indices for audio sub-samples, which we will be leveraging shortly:

# get all file names 
ROOT_DIR = 'UrbanSound8K/audio/' 
files = glob.glob(ROOT_DIR+'/**/*') 

# load raw audio data 
def get_sound_data(path, sr=22050): 
    data, fsr = sf.read(path) 
    data_resample = librosa.resample(data.T, fsr, sr) 
    if len(data_resample.shape) > 1: 
        data_resample = np.average(data_resample, axis=0) 
    return data_resample, sr 

# function to get start and end indices for audio sub-sample 
def windows(data, window_size): 
    start = 0 
    while start < len(data): 
        yield int(start), int(start + window_size) 
        start += (window_size / 2)

The feature engineering strategy we will be following is slightly complex, but we will try and explain it here in a concise way. We have already seen that our audio data samples are of varying length. However, if we want to build a robust classifier, our features need to be consistent per sample. Hence, we will be extracting audio sub-samples (of a fixed length) from each audio file and extracting features from each of these sub-samples.

We will be using a total of three feature engineering techniques to build three feature representation maps, which will ultimately give us a three-dimensional image feature map for each of our audio sub-samples. The following diagram depicts the workflow we will be adopting:

The idea for this came from an excellent paper by Karol J. Piczak, Environmental sound classification with convolutional neural networks (https://ieeexplore.ieee.org/document/7324337/), IEEE 2015. He leveraged mel spectrograms to general necessary features that can be consumed by CNNs for feature extraction. However, we have considered a couple more transformations for the final feature maps.

The first step is to define the total number of frames (columns) to be 64 and bands (rows) to be 64, which forms the dimensions of each of our feature maps (64 x 64). Then, based on this, we extract windows of audio data, forming sub-samples from each audio data sample.

Considering each audio sub-sample, we start by creating a mel spectrogram. From this, we create a log-scaled mel spectrogram as one of the feature maps, an averaged feature map of the harmonic and percussive components (again log-scaled) of our audio sub-sample and the delta or derivative of our log-scaled mel spectrogram as the third feature map. Each of these feature maps can be represented as a 64 x 64 image and by combining them we get a 3-D feature map of dimensions (64, 64, 3) for each audio sub-sample. Let's define the function for this workflow now:

def extract_features(file_names, bands=64, frames=64): 
    window_size = 512 * (frames - 1)   
    log_specgrams_full = [] 
    log_specgrams_hp = [] 
    class_labels = [] 
    
    # for each audio sample 
    for fn in file_names: 
        file_name = fn.split('')[-1] 
        class_label = file_name.split('-')[1] 
        sound_data, sr = get_sound_data(fn, sr=22050) 

        # for each audio signal sub-sample window of data 
        for (start,end) in windows(sound_data, window_size): 
            if(len(sound_data[start:end]) == window_size): 
                signal = sound_data[start:end] 

                # get the log-scaled mel-spectrogram 
                melspec_full = librosa.feature.melspectrogram(signal, 
                                                              n_mels = 
                                                                 bands) 
                logspec_full = librosa.logamplitude(melspec_full) 
                logspec_full = logspec_full.T.flatten()[:,np.newaxis].T 

                # get the log-scaled, averaged values for the  
                # harmonic and percussive components 
                y_harmonic, y_percussive =librosa.effects.hpss(signal) 
                melspec_harmonic =  
                         librosa.feature.melspectrogram(y_harmonic,   
                                                        n_mels=bands) 
                melspec_percussive =   
                        librosa.feature.melspectrogram(y_percussive,   
                                                       n_mels=bands) 
                logspec_harmonic = 
                       librosa.logamplitude(melspec_harmonic) 
                logspec_percussive = 
                       librosa.logamplitude(melspec_percussive) 
                logspec_harmonic = logspec_harmonic.T.flatten()[:, 
                                                          np.newaxis].T 
                logspec_percussive = logspec_percussive.T.flatten()[:,  
                                                          np.newaxis].T 
                logspec_hp = np.average([logspec_harmonic,  
                                        logspec_percussive],  
                                        axis=0) 
                log_specgrams_full.append(logspec_full) 
                log_specgrams_hp.append(logspec_hp) 
                class_labels.append(class_label) 
    
    # create the first two feature maps             
    log_specgrams_full = np.asarray(log_specgrams_full).reshape( 
                                        len(log_specgrams_full), bands,  
                                        frames, 1) 
    log_specgrams_hp = np.asarray(log_specgrams_hp).reshape( 
                                        len(log_specgrams_hp), bands,   
                                        frames, 1) 
    features = np.concatenate((log_specgrams_full,  
                               log_specgrams_hp,                                  
                               np.zeros(np.shape( 
                                      log_specgrams_full))),  
                               axis=3) 
    
    # create the third feature map which is the delta (derivative)    
    # of the log-scaled mel-spectrogram 
    for i in range(len(features)): 
        features[i, :, :, 2] = librosa.feature.delta(features[i,   
                                                              :, :, 0]) 
    return np.array(features), np.array(class_labels, dtype = np.int)

We are now ready to use this function. We will be using it on all our 8,732 audio samples to create feature maps out of many sub-samples from this data, based on the strategy we discussed in the workflow earlier:

features, labels = extract_features(files) 
features.shape, labels.shape 
((30500, 64, 64, 3), (30500,))

We get a total of 30,500 feature maps from our 8,732 audio data files. This is excellent and, as we discussed earlier, each feature map is of dimensions (64, 64, 3). Let's now look at the overall class representation for our audio sources based on these 30,500 data points:

from collections import Counter 
Counter(labels) 
Counter({0: 3993, 1: 913, 2: 3947, 3: 2912, 4: 3405, 
         5: 3910, 6: 336, 7: 3473, 8: 3611, 9: 4000})

We can see that the overall distribution of data points in different categories is quite uniform and decent. For some categories like 1 (car_horn) and 6 (gun_shot), the representation is quite low as compared to the other categories; this is expected, because audio data duration for these categories is typically much shorter than with the other categories. Let's go ahead and visualize these feature maps now:

class_map = {'0' : 'air_conditioner', '1' : 'car_horn', '2' :  
            'children_playing','3' : 'dog_bark', '4' : 'drilling','5' : 
            'engine_idling','6' : 'gun_shot', '7' : 'jackhammer', '8' :  
            'siren', '9' : 'street_music'} 
categories = list(set(labels)) 
sample_idxs = [np.where(labels == label_id)[0][0] for label_id in 
               categories] 
feature_samples = features[sample_idxs] 

plt.figure(figsize=(16, 4)) 
for index, (feature_map, category) in enumerate(zip(feature_samples,  
                                                    categories)): 
    plt.subplot(2, 5, index+1) 
    plt.imshow(np.concatenate((feature_map[:,:,0],  
                               feature_map[:,:,1],    
                               feature_map[:,:,2]),  
                               axis=1),
                               cmap='viridis') 
    plt.title(class_map[str(category)]) 
plt.tight_layout() 
t = plt.suptitle('Visualizing Feature Maps for Audio Clips')

The feature maps will appear as follows:

The preceding diagram shows us what some sample feature maps look like for each audio category and, as is evident, each feature map is a three-dimensional image. We will now save these base features to disk:

joblib.dump(features, 'base_features.pkl') 
joblib.dump(labels, 'dataset_labels.pkl')

These base features will act as a starting point for further feature engineering in the next section, where we will unleash the true power of transfer learning.

Table of Contents for Feature engineering and representation of audio events

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature engineering and representation of audio events