List of Figures

Figure 2.1A synthetic audio signal.12
Figure 2.2A STEREO audio signal.14
Figure 2.3Short-term processing of an audio signal.26
Figure 3.1Plots of the magnitude of the spectrum of a signal consisting of three frequencies at 200, 500, and 1200 Hz.38
Figure 3.2A synthetic signal consisting of three frequencies is corrupted by additive noise.40
Figure 3.3The spectrogram of a speech signal.41
Figure 3.4Spectrograms of a synthetic, frequency-modulated signal for three short-term frame lengths.42
Figure 3.5Spectrum representations of (a) an analog signal, (b) a sampled version when the sampling frequency exceeds the Nyquist rate, and (c) a sampled version with insufficient sampling frequency. In the last case, the shifted versions of the analog spectrum are overlapping, hence the aliasing effect.43
Figure 3.6Spectral representations of the same three-tone (200, 500 and 3000 HZ) signal for two different sampling frequencies (8 kHz and 4 kHz).44
Figure 3.7Frequency response of a pre-emphasis filter for a = −0.95.51
Figure 3.8An example of the application of a lowpass filter on a synthetic signal consisting of three tones.53
Figure 3.9Example of a simple speechdenoising technique applied on a segment of the diarizationExample.wav file, found in the data folder of the library of the book.55
Figure 4.1Mid-term feature extraction: each mid-term segment is short-term processed and statistics are computed based on the extracted feature sequence.63
Figure 4.2Plotting the results of featureExtractionFile(), using plotFeaturesFile(), for the six feature statistics drawn from the 6th adopted audio feature.68
Figure 4.3Histograms of the standard deviation by mean ratio image of the short-term energy for two classes: music and speech.72
Figure 4.4Example of a speech segment and the respective sequence of ZCR values.74
Figure 4.5Histograms of the standard deviation of the ZCR for music and speech classes.75
Figure 4.6Sequence of entropy values for an audio signal that contains the sounds of three gunshots. Low values appear at the onset of each gunshot.77
Figure 4.7Histograms of the minimum value of the entropy of energy for audio segments from the genres of jazz, classical and electronic music.78
Figure 4.8Histograms of the maximum value of the sequence of values of the spectral centroid, for audio segments from three classes of environmental sounds: others1, others2, and others3.81
Figure 4.9Histograms of the maximum value of the sequences of the spectral spread feature, for audio segments from three music genres: classical, jazz, and electronic.82
Figure 4.10Histograms of the standard deviation of sequences of the spectral entropy feature, for audio segments from three classes: music, speech, and others1 (low-level environmental sounds).83
Figure 4.11Histograms of the mean value of the sequence of spectral flux values, for audio segments from two classes: music and speech.85
Figure 4.12Example of the spectral rolloff sequence of an audio signal that consists of four music excerpts. The first 5 s stem from a classical music track.87
Figure 4.13Frequency warping function for the computation of the MFCCs.88
Figure 4.14Histograms of the standard deviation of the 2nd MFCC for the classes of music and speech.91
Figure 4.15Chromagrams for a music and a speech segment.92
Figure 4.16Autocorrelation, normalized autocorrelation, and detected peak for a periodic signal.94
Figure 4.17Histograms of the maximum value of sequences of values of the harmonic ratio for two classes of sounds (speech and others1).96
Figure 5.1Generic diagram of the classifier training stage.112
Figure 5.2Diagram of the classification process.113
Figure 5.3Linearly separable classes in a two-dimensional feature space.118
Figure 5.4Decision tree for a classification task with 3-classes (ω1, ω2, ω3) and three features (x1, x2, x3).122
Figure 5.5Decision tree for a 4-class task with Gaussian feature distributions in the two-dimensional feature space.123
Figure 5.6Decision tree for amusical genre classification taskwith two feature statistics (minimum value of the entropy of energy and mean value of the spectral flux).124
Figure 5.7SVM training for different values of the C parameter.128
Figure 5.8Classification accuracy on the training andtestingdataset for different values of C.129
Figure 5.9Implementation of the k-NN classification procedure.132
Figure 5.10Binary classification task with Gaussian feature distributions and two different decision thresholds.137
Figure 5.11Performance of the k-NN classifier on an 8-class task, for different values of the k parameter and for two validation methods (repeated hold-out and leave-one-out).143
Figure 5.12Estimated performance for the 3-class musical genre classification task, for different values of the k parameter and for two evaluation methods (repeated hold-out and leave-one-out).145
Figure 6.1Post-segmentation stage: the output of the first stage can be (a) a sequence of hard classification decisions, Ci i =1, …, Nmt; or (b) a sequence of sets of posterior probability estimates, Pi(j), i = 1,…, Nmt, j = 1,…,Nc.155
Figure 6.2Fixed-window segmentation.156
Figure 6.3Fixed-window segmentation: naive merging vs Viterbi-based smoothing.159
Figure 6.4Example of the silence detection approach implemented in silenceDetectorUtterance().162
Figure 6.5Speech-silence segmenter applied on a short-duration signal.164
Figure 6.6Fixed-window segmentation with an embedded 4-class classifier (silence, male speech, female speech, and music).166
Figure 6.7A sequence of segments in the dynamic programming grid.168
Figure 6.8Top: Signal change detection results from a TV program. Bottom: Ground truth.171
Figure 6.9A clustering example in the two-dimensional feature space.173
Figure 6.10Silhouette example: the average Silhouette measure is maximized when the number of clusters is 4.176
Figure 6.11Block diagram of the speaker diarization method implemented in speakerDiarization().178
Figure 6.12Visualization of the speaker diarization results obtained by the speakerDiarization() function (visualization is obtained by calling the segmentationPlotResults() function).179
Figure 8.1Self-similarity matrix for the track ‘True Faith’ by the band New Order.215
Figure 8.2Approximation of the second derivative, D2 of sequence B.218
Figure 8.3Visualization results for the three linear dimensionality reduction approaches, applied on the musicSmallData.mat dataset.225
Figure 8.4gridtop topology (5 × 5).226
Figure 8.5Visualization of selected nodes of the SOM of the data in the musicLargeData.mat dataset.227
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset