1

Introduction

Abstract

This chapter has an introductory purpose. A chapter outline is provided, along with general notes on the book’s exercises and the companion software. Before we proceed, it is important to note that, although in this book the term audio does not exclude the speech signal, we are not focusing on traditional speech-related problems that have been studied by the research community for decades, e.g., speech recognition and coding.

Keywords

Audio analysis

MATLAB

During recent years we have witnessed the increasing availability of audio content via numerous distribution channels both for commercial and non-profit purposes. The resulting wealth of data has inevitably highlighted the need for systems that are capable of analyzing the audio content in order to extract useful knowledge that can be consumed by users or subsequently exploited by other processing systems.

Before we proceed, it is important to note that, although in this book the term ‘audio’ does not exclude the speech signal, we are not focusing on traditional speech-related problems that have been studied by the research community for decades, e.g. speech recognition and coding. It is our intention to provide analysis methods that can be used to study various audio modalities and their relationships in mixed audio streams. Consider, for example, the task of segmenting a radio broadcast into homogeneous parts that contain either speech, music, or silence. The development of a solution for such a task demands that we are familiar with various audio modalities and how they affect the performance of segmentation algorithms in audio streams. In other words, we are not interested in providing solutions that are well tailored to specific audio types (e.g. the speech signal) but are not applicable to other modalities.

As with several other types of media, the automatic analysis of audio signals has been gaining increasing interest during the past decade. Depending on the storage/distribution format, the respective audio content classes, the co-existence of other media types (e.g. moving image), the user requirements, the data volume, the application context, and numerous other parameters, a diversity of applications and research trends have emerged to deal with various audio analysis tasks. The following list includes both speech and non-speech tasks so as to provide a general idea of the trends in several popular areas of speech/audio processing:

• Speech recognition: this is the task of ‘translating’ a speech signal to text using computational tools. Speech recognition is the oldest domain of audio analysis, but it is beyond the purpose of this book to provide a detailed study on speech recognition. We only present generic dynamic time warping and temporal modeling techniques that can also be applied on other audio signals.

• Speaker identification, verification and diarization: These speaker-related tasks focus on designing methods that discriminate between different speakers. Speaker identification and verification can be useful in the development of secure systems and speaker diarization, being able to answer the question ‘who spoke when?’, can be used in conversation summarization systems.

• Music information retrieval (MIR): due to the huge increase in the amount of available digital music data during the past few years, there has been an increasing need for the automatic analysis of this type of data. MIR focuses on automatically extracting information from the music signal for the purposes of content tagging, intelligent indexing; retrieval; browsing of music tracks; recommendation of new tracks based on music content (possibly combined with user preferences and collaborative knowledge); segmentation of music tracks, generation of summaries; extraction of automated music transcriptions, etc.

• Audio event detection: this is the task of detecting audio events in audio streams. There can be numerous related applications, like audio-based surveillance, violence detection, and intrusion detection, to name but a few.

• Speech emotion recognition: this is the task of predicting the speaker’s emotional state (anger, sadness, etc.) using speech analysis techniques. Emotion recognition has been gaining increasing interest during the last decade. The audio stream is either used independently, or in collaboration with visual cues (e.g. facial features). Emotion recognition is expected to play an important role in the next-generation human-computer interaction systems, but it can be also be used to enhance the functionality of other systems that perform retrieval and multimedia content characterization tasks.

• Multimodal analysis of the movie content: this task aims to automatically recognize events and classes in movies based on audio, visual, and textual information. The audio cues can contain rich information regarding events like the existence of music, speech, sound effects (gunshots, human fights), emotions, etc. The resulting metadata can serve indexing and fast browsing purposes in the context of next-generation multimedia systems.

The purpose of this book is to serve as a standalone introduction to audio signal analysis by providing a sufficient theoretical background for many state-of-the-art techniques, along with a large number of reproducible MATLAB examples. It is important to note that it is not our intention to demand that the reader be familiar with concepts from a variety of disciplines, such as signal processing and machine learning, although, of course, knowledge improves the reading experience. However, in each chapter, we focus on providing a smooth transition from introductory issues to more advanced ones, assuming that the reader is a beginner in the field. For example, we present the classification of audio segments but instead of assuming that the reader has knowledge of the respective pattern recognition concepts, we provide an introduction to the subject, ensuring that we: (a) complement the description with MATLAB examples and (b) evaluate the audio analysis domain (e.g. discuss a binary classifier via a speech-music discrimination example). Furthermore, the first chapters of the book introduce basic signal processing concepts like sampling and frequency representations.

1.1 The MATLAB Audio Analysis Library

Further to the necessary theoretical background, we also provide a complete set of MATLAB files that constitute the MATLAB Audio Analysis Library of this book. Where we find it useful from a pedagogical perspective, parts of the code are listed in the book. However, in most cases, the complete MATLAB code is omitted. We prefer to describe how to ‘call specific functions,’ to report on what to expect, to present and discuss the results, and so on.

The accompanying library is an important companion to the book that is aimed at helping the reader to understand the related theory and experiment with their own audio analysis solutions. A list of the available MATLAB functions, along with brief descriptions, is given in the Appendix of this book.

1.2 Outline of Chapters

Chapter 2 provides information and techniques for the basic issues related to the creation, representation, playback, recording, and storing of audio signals in MATLAB. Although the focus of the chapter is on practical issues, we also describe the basic theory of content creation. At the end of the chapter, we describe the process of breaking an audio signal into short-term windows to enable audio analysis on a short-term basis. This is in preparation for the next two chapters, as frequency representations and feature extraction both require the short-term processing stage of the signal.

In Chapter 3 we present methods for representing audio signals in the frequency domain, mostly focusing on the discrete Fourier transform. In addition, we provide a basic description of filtering techniques by Means of MATLAB examples.

Chapter 4 presents a wide range of features from the time and frequency domains, that have been widely used in various audio analysis approaches. Sufficient theoretical background is provided for each feature along with MATLAB code from the companion Library. In addition, the discrimination ability of each feature for particular types of sound is demonstrated.

After the reader has been introduced to a series of audio features and their discrimination capabilities for selected audio classes, Chapter 5 describes the task of classifying unknown audio segments of homogeneous content. For instance, in Chapter 4, the reader will learn that, in many cases, the standard deviation of short-time energy can discriminate between music and speech segments, whereas in Chapter 5 the reader will learn how to combine several feature statistics in the context of a classification procedure. To this end, we provide necessary theoretical background for a series of standard classification techniques, including support vector machines, decision trees, and the image-nearest-neighbor method. The reader will be also introduced to generic performance measures and validation methods for the estimation of the performance of a classifier. The chapter concludes with the presentation of performance measurements for a series of typical audio classification tasks (e.g. music vs speech classification).

Chapter 6 presents another processing stage of vital importance in audio analysis: the segmentation stage. The goal of this task is to split an uninterrupted audio signal into segments of homogeneous content. In this chapter we focus on two general segmentation methodologies: those that exploit prior knowledge of the audio types involved (and therefore embed some type of classification mechanism) and those that are unsupervised (or semi-supervised). We also study segmentation tasks of particular interest, e.g. silence removal and speaker diarization.

Chapter 7 also looks at classification techniques, but from a different point of view to Chapter 5. The focus with this chapter is on template matching and hidden Markov modeling and its goal is to exploit the temporal evolution of the feature sequence, whereas the methods in Chapter 5 are based on statistical averages of the feature sequences on a mid-term and long-term basis.

Finally, Chapter 8 presents a series of music information retrieval applications, including music thumbnailing; tempo and music meter induction; and music content visualization. It presents methods that combine audio analysis techniques from earlier chapters, in the context of music information retrieval applications. However, in some cases new concepts are introduced.

The book also provides an extensive appendix that covers the following:

• The MATLAB functions and data of the software library.

• A short description of libraries and software packages available on the Web that are related to audio analysis and pattern recognition. This is not limited to MATLAB approaches, as Python and C/C++ packages are also included.

• Provides a list of datasets that are available on the Web, that can be used as training and evaluation data for several audio analysis tasks.

1.3 A Note on Exercises

At the end of each chapter, we provide a set of exercises. The purpose of the assignments is to cover the content of the respective chapter, while triggering the interest of the reader to extend his/her knowledge on audio analysis tasks. The exercises have been graded based on their difficulty with a five-level grading system, as presented in Table 1.1.

Table 1.1

Difficulty Levels of the Exercises

LevelDescription
D1Simple questions that require either trivial MATLAB coding or a short answer.
D2Exercises that test the reader’s understanding of the MATLAB code of the respective chapter.
D3More complex assignments that require either a combination of knowledge presented in the chapter or more advanced critical thinking. May require writing MATLAB code.
D4Larger assignments that require more MATLAB coding. Can be used as weekly or monthly projects in the context of an audio analysis course.
D5The most challenging level includes exercises that require extensive MATLAB coding, external bibliographic references, or external MATLAB libraries. Exercises of this level can be used as final projects.

We encourage readers who are new to the field to read all chapters from the beginning of the book and complete all the exercises. More advanced readers can skip certain parts of the book based on their experience. The book is suitable for undergraduate courses on audio, especially when a diverse audience is involved. The MATLAB code can be also used to set up laboratory exercises and projects.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset