How to do it...

The strategy discussed above is coded as follows (the code file is available as Genre_classification.ipynb in GitHub):

Download the dataset and import the relevant packages:

import sys, re, numpy as np, pandas as pd, music21, IPython, pickle, librosa, librosa.dsiplay, os
from glob import glob
from tqdm import tqdm
from keras.utils import np_utils

Loop through the audio files to extract the mel spectrogram input features of the input audio, and store the output genre for the audio input:

song_specs=[]
genres = []
for genre in os.listdir('...'): # Path to genres folder
  song_folder = '...' # Path to songs folder
  for song in os.listdir(song_folder):
    if song.endswith('.au'):
      signal, sr = librosa.load(os.path.join(song_folder, song), sr=16000)
      melspec = librosa.feature.melspectrogram(signal, sr=sr).T[:1280,]
      song_specs.append(melspec)
      genres.append(genre)
      print(song)
  print('Done with:', genre)

In the preceding code, we are loading the audio file and extracting its features. Additionally, we are extracting the melspectrogram features of the signal. Finally, we are storing the mel features as the input array and the genre as the output array.

Visualize the spectrogram:

plt.subplot(121)
librosa.display.specshow(librosa.power_to_db(song_specs[302].T),
 y_axis='mel',
 x_axis='time',)
plt.title('Classical audio spectrogram')
plt.subplot(122)
librosa.display.specshow(librosa.power_to_db(song_specs[402].T),
 y_axis='mel',
 x_axis='time',)
plt.title('Rock audio spectrogram')

The following is the output of the preceding code:

You can see that there is a distinct difference between the classical audio spectrogram and the rock audio spectrogram.

Define the input and output arrays:

song_specs = np.array(song_specs)

song_specs2 = []
for i in range(len(song_specs)):
     tmp = song_specs[i]
     song_specs2.append(tmp[:900][:])
song_specs2 = np.array(song_specs2)

Convert the output classes into one-hot encoded versions:

genre_one_hot = pd.get_dummies(genres)

Create train-and-test datasets:

x_train, x_test, y_train, y_test = train_test_split(song_specs2, np.array(genre_one_hot),test_size=0.1,random_state = 42)

Build and compile the method:

input_shape = (1280, 128)
inputs = Input(input_shape)
x = inputs
levels = 64
for level in range(7):
     x = Conv1D(levels, 3, activation='relu')(x)
     x = BatchNormalization()(x)
     x = MaxPooling1D(pool_size=2, strides=2)(x)
     levels *= 2
     x = GlobalMaxPooling1D()(x)
for fc in range(2):
     x = Dense(256, activation='relu')(x)
     x = Dropout(0.5)(x)
labels = Dense(10, activation='softmax')(x)

Note that the Conv1D method in the preceding code works in a manner very similar to that of Conv2D; however, it is a one-dimensional filter in Conv1D and a two-dimensional one in Conv2D:

model = Model(inputs=[inputs], outputs=[labels])
adam = keras.optimizers.Adam(lr=0.0001)
model.compile(loss='categorical_crossentropy',optimizer=adam,metrics=['accuracy'])

Fit the model:

history = model.fit(x_train, y_train,batch_size=128,epochs=100,verbose=1,validation_data=(x_test, y_test))

From the preceding code, we can see that the model classifies with an accuracy of ~60% on the test dataset.

Extract the output from the pre-final layer of the model:

from keras.models import Model
layer_name = 'dense_14'
intermediate_layer_model = Model(inputs=model.input,outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(song_specs2)

The preceding code produces output at the pre-final layer.

Reduce the dimensions of the embeddings to 2, using t-SNE so that we can now plot our work on a chart:

from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_img_label = tsne_model.fit_transform(intermediate_output)
tsne_df = pd.DataFrame(tsne_img_label, columns=['x', 'y'])
tsne_df['image_label'] = genres

Plot the t-SNE output:

from ggplot import *
chart = ggplot(tsne_df, aes(x='x', y='y', color='genres'))+ geom_point(size=70,alpha=0.5)
chart

The following is the chart for the preceding code:

From the preceding diagram, we can see that audio recordings for similar genres are located together. This way, we are now in a position to classify a new song into one of the possible genres automatically, without manual inspection. However, if the probability of an audio belonging to a certain genre is not very high, it will potentially go to a manual review so that misclassifications are uncommon.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...