Improving classification performance with mel frequency cepstral coefficients

We already learned that FFT is pointing us in the right direction but in itself, will not be enough to finally arrive at a classifier that successfully manages to organize our scrambled directory of songs into individual genre directories. We need a more advanced version of it.

At this point, we have to do more research. Other people might have had similar challenges in the past and already found ways that might also help us. And, indeed, there is even a yearly conference dedicated to music-genre classification, organized by the International Society for Music Information Retrieval (ISMIR). Apparently, Automatic Music Genre Classification (AMGC) is an established subfield of music information retrieval. Glancing over some of the AMGC papers, we can see that there is a bunch of work targeting automatic genre classification which might help us.

One technique that seems to be successfully applied in many cases is called mel frequency cepstral (MFCcoefficients. The MFC encodes the power spectrum of a sound, which is the power of each frequency the sound contains. It is calculated as the Fourier transform of the logarithm of the signal's spectrum. If that sounds too complicated, simply remember that the name cepstrum originates from spectrum, with the first four characters reversed. MFC has been successfully used in speech and speaker recognition. Let's see whether it also works for us.We are fortunate in that someone else already needed exactly this and published an implementation of it as part of the python_speech_features module. We can install it easily with pip. Afterward, we can call the mfcc() function, which calculates MFC coefficients, as follows:

>>> from python_speech_features import mfcc

>>> fn = Path(GENRE_DIR) / 'jazz' / 'jazz.00000.wav'
>>> sample_rate, X = scipy.io.wavfile.read(fn)
>>> ceps = mfcc(X)
>>> print(ceps.shape)
(4135, 13)

ceps contains 13 coefficients ( the default value for the num_ceps parameter of mfcc()) for each of the 4135 frames for the song. Taking all of the data would overwhelm our classifier. What we could do, instead, is an averaging per coefficient over all the frames. Assuming that the start and end of each song is possibly less genre-specific than the middle part of it, we also ignore the first and last 10 percent:

>>> num_ceps = ceps.shape[0]
>>> np.mean(ceps[int(num_ceps*0.1):int(num_ceps*0.9)], axis=0)
array([ 16.43787597, 7.44767565, -13.48062285, -7.49451887,
-8.14466849, -4.79407047, -5.53101133, -5.42776074,
-8.69278344, -6.41223865, -3.01527269, -2.75974429, -3.61836327])

Sure enough, the benchmark dataset we will be using contains only the first 30 seconds of each song, so we don't need to cut off the last 10 percent. We do it anyway, so that our code works on other datasets as well, which are most likely not truncated.

Similar to our work with FFT, we also want to cache the once-generated MFCC features and read them, instead of recreating them each time we train our classifier.

This leads to the following code:

def create_ceps(fn):
sample_rate, X = scipy.io.wavfile.read(fn)
np.save(Path(fn).with_suffix('.ceps'), mfcc(X))

for wav_fn in Path(GENRE_DIR).glob('**/*.wav'):
create_fft(wav_fn)
def read_ceps(genre_list, base_dir=GENRE_DIR):
X = []
y = []
for label, genre in enumerate(genre_list):
genre_dir = Path(base_dir) / genre
for fn in genre_dir.glob("*.ceps.npy"):
ceps = np.load(fn)
num_ceps = len(ceps)
X.append(np.mean(ceps[int(num_ceps / 10):int(num_ceps * 9 / 10)], axis=0))
y.append(label)

return np.array(X), np.array(y)

We get the following promising results with a classifier that uses only 13 features per song:

The classification performance for all genres has improved. Classical and metal are at almost 1.0 AUC. And indeed, the confusion matrix in the following plot looks much better now. We can clearly see the diagonal, showing that the classifier manages to classify the genres correctly in most cases. This classifier is actually quite usable for solving our initial task:

If we want to improve on this, this confusion matrix tells us quickly what to focus on: the non-white spots on the non-diagonal places. For instance, we have a darker spot where we mislabelled rock songs as being jazz with considerable probability. To fix this, we would probably need to dive deeper into the songs and extract things such as drum patterns and similar genre-specific characteristics. And then, while glancing over the ISMIR papers, we also read about Auditory Filterbank Temporal Envelope (AFTE) features, which seem to outperform MFCC features in certain situations. Maybe we should have a look at them as well?

The nice thing is that, only equipped with ROC curves and confusion matrices, we are free to pull in other experts' knowledge in terms of feature extractors without having to fully understand their inner workings. Our measurement tools will always tell us when the direction is right and when to change it. Of course, being a machine learner who is eager to learn, we will always have the feeling that there is an exciting algorithm buried somewhere in a black box of our feature extractors, just waiting for us to be understood.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset