Yuming Fang1, Weisi Lin1 and Stefan Winkler2
1Nanyang Technological University, Singapore
2Advanced Digital Sciences Center (ADSC), Singapore
Quality evaluation for multimedia content is a basic and challenging problem in the field of multimedia processing, as well as various practical applications such as process evaluation, implementation, optimization, testing, and monitoring. Generally, the quality of multimedia content is affected by various factors such as acquisition, processing, compression, transmission, output interface, decoding, and other systems [1–3]. The perceived quality of impaired multimedia content depends on various factors: the individual interests, quality expectations, and viewing experiences of the user; output interface type and properties; and so on [2–5].
Since the Human Visual System (HVS) and the Human Auditory System (HAS) are the ultimate receiver and interpreter of the content, subjective measurement represents the most accurate method and thus serves as the benchmark for objective quality assessment [2–4]. Subjective experiments require a number of subjects to watch and/or listen to the test material and rate its quality. The Mean Opinion Score (MOS) is used for the average rating over all subjects for each piece of multimedia content. A detailed discussion of subjective measurements can be found in Chapter 6. Although subjective experiments are accurate for the quality evaluation of multimedia content, they suffer from certain important drawbacks and limitations – they are time consuming, laborious, expensive, and so on [2]. Therefore, many objective metrics have been proposed to evaluate the quality of multimedia content in past decades. Objective metrics try to approximate human perceptions of multimedia quality. Compared with subjective viewing results, objective metrics are advantageous in terms of repeatability and scalability.
Objective quality evaluation methods can be classified into two broad types: signal fidelity metrics and perceptual quality metrics [2]. Signal fidelity metrics evaluate the quality of the distorted signal by comparing it with the reference without considering the signal content type, while perceptual quality metrics take the signal properties into consideration together with the characteristics of the HVS (for image and video content) or HAS (for audio signals). Signal fidelity metrics include traditional objective quality assessment methods such as MAE (Mean Absolute Error), MSE (Mean Square Error), SNR (Signal-to-Noise Ratio), PSNR (Peak SNR), or one of their relatives [6]. These traditional objective metrics are widely accepted in the research community for several reasons: they are well defined, and their formulas are simple and easy to understand and implement. From a mathematical point of view, minimizing MSE is also well understood.
Although signal fidelity metrics are widely used to measure the quality of signals, they generally are poor predictors of perceived quality with non-additive noise distortions [7, 8]. They only have an approximate relationship with the quality perceived by human observers, since they are mainly based on byte-by-byte comparison without considering what each byte represents [3, 4]. Signal fidelity metrics essentially ignore the spatial and temporal relationship in the content. It is well accepted that signal fidelity metrics do not align well with human perceptions of multimedia content for the following reasons [2, 3, 6, 9–11]:
To overcome the problems of signal fidelity metrics, a significant amount of effort has been spent trying to design more logical, economical, and user-oriented perceptual quality metrics [3, 5, 11–18]. In spite of the recent progress in related fields, objective evaluation of signal quality in line with human perceptions is still a long and difficult odyssey [3, 5, 12–16] due to the complex multidisciplinary nature of the problem (related to physiology, psychology, vision research, audio/speech research, and computer science), the limited understanding of human perceptions, and the diverse scope of applications and requirements. However, with proper modeling of major underlying physiological and psychological phenomena, it is now possible to develop better-quality metrics to replace signal fidelity metrics, starting with various specific practical situations.
This chapter is organized as follows. Section 3.2 provides an introduction to the quality metric taxonomy. In Section 3.3, the basic computational modules for perceptional quality metrics are given. Sections 3.4 and 3.5 introduce the existing quality metrics for images and video, respectively. Quality metrics for audio/speech are described in detail in Section 3.6. Section 3.7 presents joint audiovisual quality metrics. The final section concludes.
Existing quality metrics can be classified according to different criteria, as depicted in Fig. 3.1. There are basically two categories of perceptual metrics [5], relying on a perception-based approach or a signal-driven approach. For the first category [19–22], objective metrics are built upon relevant psychophysical properties and physiological knowledge of the HVS or HAS, while the signal-driven approach evaluates the quality of the signal from the aspect of signal extraction and analysis. Among the psychophysical properties and physiological knowledge used in perception-based approaches, the Contrast Sensitivity Function (CSF) models the HVS's sensitivity toward signal contrast with spatial frequencies and temporal motion velocities, and exhibits a parabola-like curve with increasing spatial and temporal frequencies, respectively; luminance adaptation refers to the noticeable luminance contrast as a function of background luminance; visual masking is usually the increase in HVS contrast threshold for visual content in the presence of another one, and can be divided into intra-channel masking by the visual content itself and inter-channel masking by visual content with different frequencies and orientations [1, 2, 4, 5]. For audio/speech signals, the commonly used psychophysical properties or physiological knowledge of the HAV include the effects of the outer and middle ear, simultaneous masking, forward and backward temporal masking, etc. [3].
Since the perception-based approaches involve high computational complexity and there are difficulties in bridging the gap between vision research and the requirements of engineering modeling, more recent research efforts have been directed at signal-driven perceptual quality metrics. Compared with perception-based approaches, signal-driven ones do not need to model human perception characteristics. Instead, signal-driven approaches attempt to evaluate quality from aspects of signal extraction and analysis, such as statistical features [23], structural similarity [24], luminance/color distortion [25], and common artifacts [26, 27]. These metrics also consider the effects of human perception by content and distortion analysis, instead of fundamental bottom-up perception modeling.
Another classification of objective quality metrics is the availability of the original signal, which is considered to be distortion free or of perfect quality, and might be used as reference signal to evaluate the distorted signal. Based on the availability of the original signal, the quality metrics can be divided into three categories [1, 2]: Full-Reference (FR) metrics, which require the processed signal and the complete reference signal [28–36], Reduced-Reference (RR) metrics, which require the processed signal and only part of the reference signal [23, 37], and No-Reference (NR) metrics, which require only the processed signal [38–41]. Traditional signal fidelity metrics for quality evaluation are FR metrics. Most perceptual quality metrics are of the FR type, including most perception-driven quality metrics and many signal-driven visual quality metrics. Most perceptual audio/speech quality metrics are FR. Generally, FR quality metrics can measure the quality of signals more accurately compared with RR or NR metrics, since they have more information available.
FR quality metrics evaluate the quality of the processed signal with respect to the reference signal. The traditional signal fidelity metrics – such as MSE, SNR, and PSNR – are early FR metrics. They have been the dominant quantitative performance metrics in the field of signal processing for decades. Although they exhibit poor accuracy when dealing with perceptual signals, they are still the standard criterion and widely used. Signal fidelity metrics in quality assessment try to provide a quantitative score that describes the level of error/distortion for the processed signal by comparing it with the reference signal. Suppose that X = {xi|i = 1, 2, ..., N} is a finite-length, discrete original signal (reference signal) and is the corresponding distorted signal of the original signal, where N is the signal length, and xi and are the values of the ith samples in X and , respectively. The MSE between the distorted and the reference signals is calculated as
where is used as the quality measurement of the distorted signal . A more general form of the distortion is the lp norm [11]:
The PSNR measure can be obtained from MSE as
where MAX is the maximum possible signal intensity value.
Aside from traditional signal fidelity metrics, many perceptual metrics are also FR metrics. As introduced previously, the perception-driven approach of quality assessment mainly measures the quality of images/video by modeling the characteristics of HVS. In the simplest perception-driven approaches, the HVS is considered as a single spatial filter modeling the CSF [42–45]. Many more sophisticated perception-driven approaches of FR metrics try to incorporate local contrast, spatiotemporal CSF, contrast/activity masking, and other advanced HVS functions to build quality metrics for images/video [19–22, 46–50]. Compared with perception-driven approaches, signal-driven approaches for quality assessment are relatively less sophisticated and thus computationally inexpensive. Currently, there are many FR metrics for signal-driven approaches to visual quality assessment [35, 36, 51–54]. Similarly, most perceptual audio/speech quality metrics are FR [3, 55–76]. Early audio quality metrics were designed for low-bit-rate speech and audio codecs. Perceptual-based models were used to optimize distortion for minimum audibility rather than traditional signal fidelity metrics, leading to an improvement in perceived quality [30]. Based on the characteristics of the HAS, various perceptual FR audio/speech quality metrics have been proposed [17, 63–76]. A detailed discussion of audio/speech quality metrics will be provided in Section 3.5.
Generally, FR metrics require the complete reference signal, usually in unimpaired and uncompressed form. This requirement is quite a heavy restriction for practical applications. Furthermore, FR metrics generally impose a precise alignment of the reference and distorted signals, so that each sample in the distorted signal can be matched with its corresponding reference sample. For video or audio signals, temporal registration in particular can be very difficult to achieve in practice due to the information loss, content repeats, or variable delays introduced by the system. Aside from the issue of spatial and temporal alignment, FR metrics usually do not respond well to global shifts in certain features (such as brightness, contrast, or color), and require a corresponding calibration of the signals. Therefore, FR metrics are most suitable for offline signal quality measurements such as codec tuning or lab testing.
RR quality metrics only require some information about the reference (e.g., in the form of a number of features extracted from the reference signal) for quality assessment tasks [77–108]. Normally, the more reference information is available to an RR metric, the more accurate the predictions it can make. In the extreme, when the rate is large enough to reconstruct the original signal, RR quality metrics converge to FR metrics.
Based on the underlying design philosophy, RR quality metrics can be classified into three approaches [77]: modeling the signal distortion, using characteristics or theories of the human perception system, and analysis of signal source. The last two types can be considered as general-purpose metrics, since the statistical and perceptual features are not related to any specific type of signal distortion.
The first type of RR quality metrics based on modeling signal distortion is developed mainly for specific applications. Straightforward solutions can be provided by these methods when there is sufficient knowledge about the processing of the content. With signal distortion from standard image or video compression, RR quality metrics can define the typical distortion artifacts of blurring and blockiness to measure the quality of the related visual content [79, 81]. Various RR quality metrics are proposed to measure the distortions occurring in standard compression systems such as MPEG2-coded video [82], JPEG images [83], H.264/AVC-coded video [84], etc. The drawback with this kind of RR quality metrics is the generalization capability, since they are always designed for certain specific kinds of signal distortion.
RR quality metrics based on the human perceptual system measure content quality by extracting perceptual features, where computational models of human perception may be employed. In [86], RR quality metrics are proposed by extracting perceptual features from JPEG and JPEG2000 images and obtain good evaluation performance. Many RR quality metrics are designed for video quality evaluation based on various characteristics of HVS such as color perception theory [87], CSF [88–90], structural information perception [91], texture masking [92, 93], etc. Aside from the features extracted in the spatial domain, there are also some RR quality metrics built based on features extracted by contourlet transform [90], wavelet transform [94], Fourier transform [95], etc.
The third type of RR quality metrics measure content quality based on models of the signal source. Since the reference signal is not available in a deterministic sense, these models are often based on statistical knowledge and capture certain statistical properties of the natural scenes [77, 85]. The distortions disturb the statistical properties of the natural scenes in unnatural ways, which can be measured by the statistical models of natural scenes. Many RR quality metrics are designed based on the difference calculation from feature distributions of color [99], motion [100], etc. This type of metrics can also be designed based on features in the transform domain such as divisive normalization transform [104], wavelet transform [105, 106], Discrete Cosine Transform (DCT) [102, 107, 108], etc.
RR approaches make it possible to avoid some of the assumptions and pitfalls of pure no-reference metrics while keeping the amount of reference information manageable. Similar to FR metrics, RR metrics also have alignment requirements. However, they are typically less stringent than full-reference metrics, as only the extracted features need to be aligned. Generally, RR metrics are better suited for monitoring in-service content at different points in the distribution system.
Compared with FR and RR quality metrics, NR quality metrics do not require any reference information [109–156]. Thus, they are highly desirable in many practical applications where reference signals are not available.
A number of methods have been proposed to predict the MSE caused by certain specific compression schemes such as MPEG2 [112, 114], JPEG [116], or H.264 [117, 118, 120]. These methods use information from the bit stream directly, except the study [112] which adopts the decoded pixel information. The DCT coefficients in these techniques are usually modeled by Laplacian, Gaussian, or Cauchy distributions. The main drawback of these techniques is that they measure the distortion for each 8×8 block without considering the differences from neighboring blocks [110]. Another problem with these methods is that the performance decreases with lower bit rate due to more coefficients quantized to zero. Some NR visual quality metrics have tried to measure the MSE caused by packet loss errors [121, 124, 139], the difference between the processed signal and the smoothed signal [126], the variation within the smooth regions in the signal [129], etc.
Generally, NR quality metrics assume that the statistics of the processed signals are different from those of the original and extract features from the processed signals to evaluate model compliance [85, 110]. NR quality metrics can be designed based on features of various domains, such as the spatial domain [27, 132], Fourier domain [24], DCT domain [134, 135], or polynomial transform [136]. Additionally, many NR quality metrics are based on various features from visual content, such as sharpness [137], edge extent [138, 140], blurring [143, 144], phase coherence [145], ringing [146, 147], naturalness [148], or color [149]. In some NR quality metrics, blockiness, blurring, or ringing features are combined with other features, such as bit-stream features [150, 151], edge gradient [152], and so on. Compared with image content, temporal features have to be considered for quality assessment of video and audio signals. Various NR quality metrics for video content are proposed to measure flicker [153] or frame freezes [154, 156]. There are also some NR metrics proposed for speech quality evaluation [109, 113, 115, 119, 122, 123], which are mainly designed based on analysis of the audio spectrum.
NR metrics analyze the distorted signal without the need for an explicit reference signal. This makes them much more flexible than FR or RR metrics, as it can be difficult or impossible to get access to the reference in some cases (e.g., video coming out of a camera). They are also completely free from alignment issues. The main challenge of NR metrics lies in telling apart distortions from content, a distinction humans are usually able to make from experience. NR metrics always have to make assumptions about the signal content and/or the distortions of interest. This comes with a risk of confusing the actual content with distortions (e.g., a chessboard could be interpreted as a block of artifacts under certain conditions). Additionally, most NR quality metrics are designed for specific and limited types of distortion. They can face difficulties in modern communication systems, where distortions could be a combination of compression, adaptation, network delay, packet loss, and various types of process filtering. NR metrics are suited for monitoring in-service content at different points in the distribution system, since they do not require reference signals.
Since the traditional signal fidelity metrics assess the quality of the distorted signal by simply comparing it with the reference one and they are well introduced in Section 3.2.1, here we just provide the basic computational modules for perceptual quality metrics. Generally, these include signal decomposition (e.g., decomposing an image or video into different color, spatial, and temporal channels), detection of common features (like contrast and motion) and artifacts (like blockiness and blurring), just-noticeable distortion (i.e., the maximum change in visual content that cannot be detected by the majority of viewers), Visual Attention (VA) (i.e., the HVS's selectivity in responding to the most interesting activities in the visual field), etc. First, many of these are based on related physiological and psychological knowledge. Second, most are independent research topics themselves, like just-noticeable distortion and VA modeling, and have other applications (image/video coding [157], watermarking [158], error resilience [159], computer graphics [160], to name just a few), in addition to perceptual quality metrics. Third, these modules can be simple perceptual quality metrics themselves in specific situations (e.g., blockiness and burring).
Most perception-driven quality metrics use signal decomposition for feature extraction. Signal feature extraction and common artifact detection are at the core of many signal-driven quality metrics; the perceptual effect of common artifacts far exceeds the extent of their representation in MSE or PSNR. Just-noticeable distortion and VA models have been used either independently or jointly to evaluate the visibility and perceived extent of visual content differences. Therefore, all these techniques help to address the three basic problems (as mentioned at the beginning of this chapter) to be overcome relative to traditional signal fidelity metrics, since they enable the differentiation of various content changes for perceptual quality-evaluation purposes.
For images and video, the process of signal decomposition refers to the decomposition of visual content into different channels (spatial, frequency, and temporal) for further processing. It is well known that the HVS has separate processing for achromatic and chromatic components, different pathways for visual content with different motion, and special cells in the visual cortex for distinctive orientations [161]. Existing psychophysical studies also show that visual content is processed differently in the HVS by frequency [162] and orientation [163, 164]. Thus, decomposition of an image or video frame into different color, spatial, and temporal channels can evaluate content changes for unequal treatment of each channel to emulate the HVS response, which can address the third problem of traditional signal fidelity metrics mentioned at the beginning of this chapter.
Currently, there are various signal decomposition methods for color [165–168]. Two widely accepted color spaces in quality assessment are the opponent-color (black/white, red/green, blue/yellow) space [22, 166] based on physiological evidence of opponent cells in the parvocellular pathway and CIELAB space [167] based on human perceptions of color differences. With compressed visual content, YCbCr space is more convenient for feature extraction due to its wide use in image/video compression standards [53, 165, 168]. Other color spaces have also been used, such as YOZ [21]. In many metrics, only the luminance component of the visual content is used for efficiency [49, 169–171], since it is generally more important for human visual perception than chrominance components, especially in quality evaluation of compressed images (it is worthwhile pointing out that most coding decisions in current image/video compression algorithms are made based on luminance manipulation).
Temporal decomposition is implemented by a sustained (low-pass) filter and transient (band-pass) filters [46, 172] to stimulate two different visual pathways. Based on the fact that receptive fields in the primary visual cortex resemble Gabor patterns [173] that can be characterized by a particular spatial frequency and orientation, many types of spatial filter can be used to decompose each temporal channel, including Gabor filters, cortex filters [174], wavelets, Gaussian pyramid [175], and steerable pyramid filters [46, 176].
For audio/speech signals, signal decomposition is implemented based on the properties of the peripheral auditory system – such as the perception of loudness, frequency, masking, etc. [177]. In the Perceptual Evaluation of Audio Quality (PEAQ) ITU standard [178], two psychoacoustic models are adopted to transform the time-domain input signals into a basilar membrane representation for further processing: the FFT (Fast Fourier Transform)-based ear model and the filter-bank-based ear model [3]. For the FFT-based ear model, the input signal is first transformed into the frequency domain. The amplitude of the FFT is used for further processing. Then the characteristics of the effect in the outer and middle ear on audio signals are modeled based on Terhardt's approach for frequency components [38]. After that, the frequency components are grouped into critical frequency bands as perceived by the HAS. An internal ear-noise model is used to obtain the pitch patterns for audio signals [3]. These pitch patterns are smeared out over the frequencies by a spreading function modeling simultaneous masking. Finally, the forward masking characteristics of temporal masking effects are approximated by a simple first-order low-pass filter. In the filter-bank-based ear model, the audio signal is processed in the time domain [3]. Compared with the FFT-based ear model, the filter-bank-based ear model adopts a finer time resolution, which makes the modeling of backward masking possible and thus maintains the fine temporal structure of the signal. PEAQ is mainly based on the model in [179]. First, the input audio signal is decomposed into band-pass signals by a filter bank of equally spaced critical bands. Similar to the FFT-based ear model, the effects in the outer and middle ear on audio signals are modeled. Then the characteristics of simultaneous masking, backward masking, internal ear noise, and forward masking are modeled subsequently for the final feature extraction for audio signals [3].
Similar signal decomposition methods based on psychoacoustic models have been implemented in many studies [67, 75]. During the transformation, the input audio signals are decomposed into different band-pass signals by modeling various characteristics in the HAS, such as characteristics of effect of the outer and middle ear [38], simultaneous masking (frequency spreading) [180], forward and backward temporal masking effects [3, 122], etc. Following PEAQ, some other new psychoacoustic models have been proposed by incorporating recent findings into the design of perceptual audio quality metrics [17, 70].
The process of feature and artifact detection is common for visual quality evaluation in various scenarios. For example, meaningful visual information is conveyed by feature contrast such as luminance, color, orientation, texture, motion, etc. There is little or no information in a largely uniform image. The HVS perceives much more from signal contrast than from absolute signal strength, since there are specialized cells to process this information [181]. This is also the reason why contrast is central to CSF, luminance adaptation, contrast masking, visual attention, and so on.
For audio/speech quality metrics, various features are extracted for artifact detection in transform domains such as modulation, loudness, excitation, and slow-gain variation features [3, 178]. For NR metrics of speech quality assessment, the perceptual linear prediction coefficients are used for quality evaluation [115, 128]. In [155], the vocal tract and unnaturalness features of speech are extracted from speech signals for quality evaluation. In PEAQ, the cognitive model processes the parameters from the psychoacoustic model to obtain Model Output Variables (MOVs) and maps them to a single Overall Difference Grade (ODG) score [3]. The MOVs are extracted based on various parameters including loudness, amplitude modulation, adaption, masking, etc., and they also model concepts such as linear distortion, bandwidth, modulation difference, noise loudness, etc. The MOVs are used as input to the neural network and mapped to a distortion index. Then the ODG is calculated based on the distortion index to estimate the quality of the audio signal.
There are certain structural artifacts occurring in the prevalent signal compression and delivery process which result in annoying effects for the viewer. The common structural artifacts caused by coding include blockiness, blurring, edge damage, and ringing [171], whose perceptual effect is ignored in traditional signal fidelity metrics such as MSE and PSNR. In fact, uncompressed images/video usually include blurring artifacts due to the imperfect PSF (Point Spread Function) and an out-of-focus imaging system as well as object motion during the signal capture process [182]. In video quality evaluation, the effects of motion and jerkiness have been investigated [156, 183]. Similarly, coding distortions in audio/speech signals have been well investigated [30, 63–65]. Some studies have investigated the quality evaluation of noise-suppressed audio/speech [125].
Another type of quality metrics is designed specifically to measure the impact of network losses on perceptual quality. This development is the result of increasing multimedia service delivery over IP networks, such as Internet streaming or IPTV. Since information loss directly affects the encoded bit stream, such metrics are often designed based on parameters extracted from the transport stream and the bit stream with no or little decoding. This has the added advantage of much lower data rates and thus lower bandwidth and processing requirements compared with metrics looking at the fully decoded video/audio. Using such metrics, it is thus possible to measure the quality of many video/audio streams or channels in parallel. At the same time, these metrics have to be adapted to specific codecs and network protocols. Due to the different types of features and artifacts, so-called “hybrid” metrics use a combination of different features or approaches for quality assessment [5]. Some studies explore the joint impact of packet loss rate and MPEG-2 bit rate on video quality [184], the influence of bit-stream parameters (such as motion vector length or number of slice losses) on the visibility of packet losses in MPEG-2 and H.264 videos [139], the joint impact of the low-bit-rate codec and packet loss on audio/speech quality [185], etc.
In some quality metrics, multiple features or quality evaluation approaches are combined. In [152], several structural features such as blocking, blurring, edge-based image activity, gradient-based image activity, and intensity masking are linearly combined for quality evaluation. The feature weights are determined by a multi-objective optimization method [152]. In [169], the input visual scene is decomposed into predicted and disorderly portions for quality assessment based on an internal generative mechanism in the human brain. Structure similarity and PSNR metrics are adopted for quality evaluation in these two portions respectively, and the overall score is obtained by combining these two results with an adaptive nonlinear procedure [169]. In [186], phase congruency and gradient magnitude are employed as two complementary roles for the quality assessment of images. After calculating the local quality map, the phase congruency is adopted again as a weighting function to derive the overall score [186]. The study in [187] presents a visual quality metric based on two strategies in the HVS: the detection-based strategy for high-quality images containing near-threshold distortions and the appearance-based strategy for low-quality images containing clearly supra-threshold distortions. Different measurement methods are designed for these two types of quality level, and the overall quality evaluation score is obtained by combining the results adaptively [187]. In [70], the linear and nonlinear distortions in the perceptual transform (excitation) domain are combined linearly for speech quality evaluation. In [76], the audio quality evaluation results from spectral and spatial features are multiplied to obtain the overall quality of audio signals.
Recently, a new fusion method for different features or quality evaluation approaches has emerged based on machine learning techniques [111, 125, 188, 189]. In [111], machine learning is adopted for the feature pooling process in visual quality assessment to address the limitations of existing pooling methods such as linear combination. A similar pooling process by support vector regression is introduced for speech quality assessment in [125]. Rather than using machine learning techniques for feature pooling, a multi-method fusion quality metric is introduced based on the nonlinear combination of scores from existing methods with suitable weights from a training process in [189]. In some NR quality metrics, machine learning techniques are also adopted to learn the mapping from feature space to quality scores [190].
In [191], band-pass-filtered and low-pass-filtered images are used to evaluate the local image contrast. Following this methodology, image contrast is calculated as the ratio of the combined analytic oriented filter response to the low-pass filtered image in the wavelet domain [192] or the ratio of high-pass response in the Haar wavelet space [193]. Luminance contrast is estimated as the ratio of the noticeable pixel change to the average luminance in a neighborhood [53]. The contrast can also be computed as a local difference between the reference video frame and the processed one with the Gaussian pyramid decomposition [47], or the comparison between DCT amplitudes and the amplitude of the DC coefficient of the corresponding block [21]. The k-means clustering algorithm can be used to group image blocks for the calculation of color and texture contrast [194], where the largest cluster is considered as the image background. The contrast is then computed as the Euclidean distance from the means of the corresponding background cluster. Motion contrast can be obtained by relative motion, which is represented by object motion against the background [194].
Blockiness is a prevailing degradation caused by the block-based DCT coding technique, especially under low-bit-rate conditions, due to the different quantization sizes used in neighboring blocks and the lack of consideration for inter-block correlation. Given an image I with width W and height H, which is divided into N × N blocks, the horizontal and vertical difference at block boundaries can be computed as [27]
The method in [27] works only for a regular block structure with certain block size. It cannot work in modern video codecs (e.g., HEVC (High Efficiency Video Coding, H.265)) due to the different block sizes used. Other, similar calculation methods for blockiness can be found in [25, 195]. During the blockiness calculation, object edges at block boundaries can be excluded [29]. Luminance adaptation and texture masking have recently been considered for blockiness evaluation [135]. Another method for gauging blockiness is based on harmonic analysis [196], which can be used in the case when block boundary positions are unknown beforehand (e.g., with video being cropped, retaken by a camera, or coded with variable block sizes).
Blurring can be evaluated effectively around edges in images/video frames, since it is most noticeable there, and such detection is efficient due to only a small fraction of image pixels on edges. With an available reference signal, the extent of blurring can be estimated via contrast decrease on edges [53]. Various blind methods without a reference signal have been proposed for measuring the blurring/sharpness, such as edge spread detection [26, 40], kurtosis [146], frequency domain analysis [143, 197], PSF estimation [198], width/amplitude of lines and edges [39], and local contrast via 2D analytic filters [199].
For visual quality evaluation of coded video, the major temporal distortion is jerkiness, which is mainly caused by frame dropping [200] and is very annoying to viewers, who prefer continuous and smooth temporal transitions. For decoded video without availability of the coding parameters, frame freeze can be simply detected by frame differences [201]; in the case when the frame rate is unavailable, the jerkiness effect can be evaluated using the frame rate [156, 183] or more comprehensively, both the frame rate and temporal activity such as motion [202]. In [203], inter-frame correlation analysis is used to estimate the location, number, and duration of lost frames. In [154, 200], lost frames are detected by inter-frame dissimilarity to measure fluidity; these studies conclude that, for the same level of frame loss, scattered fluidity breaks introduce less quality degradation than aggregated ones. The impact of the time interval between occurrences of significant visual artifacts has also been investigated [204].
As mentioned previously, not every signal change is noticeable. JND refers to a visibility or audibility threshold below which a change cannot be detected by the majority of viewers [201, 205–212]. Obviously, if a difference is below the JND value, it can be ignored in quality evaluation.
For images and video, DCT-based JND is the most investigated topic among all sub-band-based JND functions, since DCT has been used in all existing image/video compression standards such as JPEG, H.261/3/4, MPEG-1/2/4, and SVC. A general form of the DCT-sub-band luminance JND function is introduced in [52, 201, 207]. The widely used JND function developed by Ahumada and Peterson [211] for the base-line threshold fits spatial CSF curves with a parabola equation, which is a function of spatial frequencies and background luminance, and then compensates for the fact that the psychophysical experiments for determining CSF were conducted with a single signal at a time, and with spatial frequencies along just one direction. The luminance adaptation factor has been determined to represent the variation versus background luminance [205], to be more consistent with the findings of subjective viewing of digital images [206, 212]. The intra-band masking effect was investigated in [52, 208]. Inter-band masking effects can be assigned as low, medium, or high masking after classifying DCT blocks into smooth, edge, and texture ones [205, 207], according to energy distribution among sub-bands. For temporal CSF effects, the velocity perceived by the retina for an image block needs to be estimated [209]. A method for incorporating the effect of the velocity for temporal CSF in JND is given in [210].
The JND can also be defined in other frequency bands (e.g., Laplacian pyramid image decomposition [210], Discrete Wavelet Transform (DWT) [213]). In comparison with DCT-based JND, significantly more research is needed for DWT-based JND. DWT is a popular alternative transform, and more importantly, is similar to the HVS in its multiple sub-channel structure and frequency-varying resolution. Chrominance masking [214] still needs more convincing investigation for all sub-band domains.
There are situations where JND estimated from pixels is more convenient and efficient to use (e.g., motion search [168], video replenishment [215], filtering of motion-compensated residuals [165], and edge enhancement [48, 216]), since the operations are usually performed on pixels rather than sub-bands. For quality evaluation of images and video, pixel-domain JND models avoid unnecessary sub-band decomposition. Most pixel-based JND functions developed so far have used luminance adaptation and texture masking. A general pixel-based JND model can be found in [216]. The temporal effect was addressed in [217] by multiplying the spatial effect with an elevation parameter increasing with inter-frame changes. The major shortcoming of pixel-based JND modeling lies in the difficulty of incorporating CSF explicitly, except for the case with conversion from a sub-band domain [218].
For audio/speech signals, there are some types of JND based on different components of the signals – such as amplitude, frequency, etc. [219]. For the amplitude of audio/speech signals, the JND for the average listener is about 1 dB [130, 220], while the frequency JND for the average listener is approximately 1 Hz for frequencies below 500 Hz and about f/500 for frequencies f above 500 Hz [221]. The ability to discriminate audio/speech signals in the temporal dimension is another important characteristic of the acuity of the HAS. The duration of the gap between two successive audio/speech signals must be at least 4–6 ms in length to be detected correctly [222]. The ability of the HAS to detect changes over time in the amplitude and frequency of audio/speech signals is dependent on the rate of change and amount of change in amplitude and frequency [131].
Not every part of a multimedia presentation receives the same attention (the second problem of signal fidelity metrics mentioned in the introduction). This is due to the fact that human perception selects a part of the signal for detailed analysis and then responds. VA refers to the selective awareness/responsiveness to visual stimuli [223], as a consequence of human evolution.
There are two types of cue that direct attention to a particular point in an image [224]: bottom-up cues that are determined by external stimuli, and top-down cues that are caused by a voluntary shift in attention (e.g., when the subject is given prior information/instruction to direct attention to a specific location/object). The VA process can be regarded as two stages [225]: in the pre-attention stage, all information is processed across the entire visual field; in the attention stage, the features may be bound together (feature integration [226], especially for a bottom-up process), or the dominant feature is selected [227] (for a top-down process).
Most existing computational VA models are bottom-up (i.e., based on contrast evaluation of various low-level features in images, in order to determine which locations stand out from their surroundings). As to the top-down (or task-oriented) attention, there is still a need for more focused research, although some initial work has been done [228, 229].
An influential bottom-up VA computational model was proposed by Itti et al. [230] for still images. An image is first low-pass filtered and down-sampled progressively from scale 0 (the original image size) to scale 8 (1:256 along each dimension). This is to facilitate the calculation of feature contrast, which is defined as
where F represents the map for one of the image features as follows: intensity, color, and orientation; e ∈ {2, 3, 4} and F(e) denote the feature map at scale e; q = e + δ, with δ ∈ {3, 4}, and Fl(q) is the interpolation to the finer scale e from the coarser scale q. In essence, evaluates pixel-by-pixel contrast for a feature, since F(e) represents the local information, while Fl(q) approximates the surroundings.
With one intensity channel, two color channels, and four orientation channels (0°, 45°, 90°, 135°; detected by Gabor filters), 42 feature maps are computed: 6 for intensity, 12 for color, and 24 for orientation. After cross-scale combination and normalization, the winner-takes-all strategy identifies the most interesting location on the map. There are various other approaches for visual attention modeling [231–234].
The VA map along the temporal dimension (over multiple consecutive video frames) can also be estimated. In the scheme proposed in [194] for video, different features (such as color, texture, motion, human skin/face) were detected and integrated for the continuous (rather than the winner-takes-all) salience map. In [235], auditory attention was also considered and integrated with visual factors. This was done by evaluating sound loudness and its sudden change, and a Support Vector Machine (SVM) was employed to classify each audio segment into speech, music, silence, and other sounds; the ratio of speech/music to other sounds was measured for saliency detection. Recently, Fang et al. proposed saliency detection models for images and videos based on DCT coefficients (with motion vectors for video) in the compressed domain [233, 236]. These models can be combined with quality metrics obtained from DCT coefficients and motion vectors for visual quality evaluation. A detailed overview and discussion of visual attention models can be found in a recent survey paper [232].
Contrast sensitivity reaches its maximum at the fovea and decreases toward the peripheral retina. The JND model represents the visibility threshold when the attention is there. In other words, JND and VA account for the local and global responses of the HVS in appreciating an image, respectively. The overall visual sensitivity at a location in the image could be JND modulated by the VA map [194]. Alternatively, the overall visual sensitivity may be derived by modifying the JND at every location according to its eccentricity away from the foveal points, with the foveation model in [237].
VA modeling is generally easier for video than still images. If an observer has enough time to view an image, many points of the image will be attended to eventually. The perception of video is different: every video frame is displayed to an observer for a very short time interval. Furthermore, camera and/or object motion may guide the viewer's eye movements and attention.
Compared with visual content, there is much less research on auditory attention modeling. Currently, there are several studies proposing auditory attention models for audio signals [238–240]. Motivated by the formation of auditory streams, the study [239] designs a conceptual framework of auditory attention, which is implemented as a computational model composed of a network of neural oscillators. Inspired by the successful visual saliency detection model proposed by Itti et al. [230], some other auditory attention models are proposed using feature contrast – such as frequency contrast, temporal contrast, etc. [238, 240].
The early image quality metrics are traditional signal fidelity metrics, which include MAE, MSE, SNR, PSNR, etc. As discussed previously, these metrics cannot predict image distortions as they are perceived. To address the drawback of traditional signal fidelity metrics, various perceptual-based image quality metrics have been proposed in the past decades [7, 11, 12, 24, 33–36]. These are classified and introduced in the following.
Early perceptual image quality metrics were developed based on simple and systematic modeling of relevant psychophysical or physiological properties. Mannos and Sakrison [10] proposed a visual fidelity measure based on CSF for images. Faugeras [42] introduced a simple model of human color vision based on experimental evidence for image evaluation. Another early FR and multichannel model is the Visible Differences Predictor (VDP) of Daly [19], where the HVS model accounts for sensitivity variations due to luminance adaptation, spatial CSF, and contrast masking. The cortex transform is performed for signal decomposition, and different orientations are distinguished. Most existing schemes in this category follow a similar methodology, with differences in the color space adopted, the type of spatiotemporal decomposition, or the error pooling methods. In the JNDmetrix model [20], the Gaussian pyramid [175] was used for decomposition, with luminance and chrominance components in the image. Liu et al. [48] proposed a JND model to measure the visual quality of images. The perceptual effect can be derived by considering inter-channel masking [46]. Other similar algorithms using CSF and visual masking are described in [25, 241].
Recently, various perceptual image quality metrics have been proposed using signal modeling or processing of visual signals under consideration, which incorporate certain specific knowledge (such as the specific distortion [21]). This approach is relatively less sophisticated and therefore less computationally expensive. In [53], the image distortion is measured by the DCT coefficient differences weighted by JND. Similarly, the well-cited SSIM (Structural SIMilarity) was proposed by Wang and Bovik based on the sensitivity of the HVS to image structure [36, 54, 242]. SSIM can be calculated as
where a and b represent the original and test images; and are their corresponding means, σa and σb are the corresponding standard deviations; σab is the cross covariance. The three terms in equation (3.7) measure the loss of correlation, contrast distortion, and luminance distortion, respectively. The dynamic range of the SSIM value Q is [−1, 1], with the best value of 1 when a = b.
Studies show that SSIM bears a certain relationship with MSE and PSNR [55, 243]. In [243], PSNR and SSIM are compared by their analytical formulas. The analysis shows that there is a simple logarithmic link between them for several common degradations, including Gaussian blur, additive Gaussian noise, JPEG and JPEG2000 compression [243]. PSNR and SSIM can be considered as closely related quality metrics, with differences in the degree of sensitivity to some image degradations.
Another method for feature detection with consideration of structural information is Singular Value Decomposition (SVD) [111]. With more theoretical background, the Visual Information Fidelity (VIF) [35] (an extension of the study [34]) is proposed based on the assumption that the Random Field (RF) from a sub-band of the test image, D, can be expressed as
where U denotes the RF from the corresponding sub-band of the reference image, G is a deterministic scale gain field, and V is a stationary additive zero-mean Gaussian noise RF. The proposed model takes into account additive noise and blur distortion; it is argued that most distortion types prevalent in real-world systems can be roughly described locally by a combination of these two. The resultant metric measures the amount of information that can be extracted about the reference image from the test. In other words, the amount of information lost from a reference image as a result of distortion gives the loss of visual quality.
Another image quality metric with good theoretical foundations is the Visual Signal-to-Noise Ratio (VSNR) [33], which operates in two stages. In the first stage, the contrast threshold for distortion detection in the presence of the image is computed via wavelet-based models of visual masking and visual summation, in order to determine whether the distortion in the test image is visible. If the distortion is below the threshold of detection, the test image is deemed to be of perfect visual fidelity (VSNR = ∞). If the distortion is above the threshold, a second stage is applied, which operates based on the property of perceived contrast, and the mid-level visual property of global precedence. These two properties are modeled as Euclidean distances in distortion-contrast space of a multiscale wavelet decomposition, and VSNR is computed based on a simple linear sum of these distances.
Larson and Chandler [187] proposed a perceptual image quality metric called the “most apparent distortion” based on two separate strategies. Local luminance and contrast masking are adopted to estimate detection-based perceived distortion in high-quality images, while changes in the local statistics of spatial-frequency components are used to estimate appearance-based perceived distortion in low-quality images [187]. Recently, some new image quality metrics have been proposed using new concepts or methods [111, 169, 188, 189, 244]. Wu et al. [169] adopted the concept of Internal Generative Mechanism (IGM) theory to divide image regions into two different parts of predicted portion and disorderly portion, which are measured by the structural similarity and PNSR metrics, respectively. Liu et al. [244] used gradient similarity to measure the change in contrast and structure. A recent emerging scheme for an image quality metric is based on machine learning techniques [111, 188, 189].
Some RR metrics for images are designed based on the properties of the HVS. In [88], several factors of the HVS – including CSF, psychophysical sub-band decomposition, and masking effect modeling – are adopted to design an RR quality metric for images. The study in [91] proposes an RR quality metric for wireless imaging based on the observation that HVS is trained to extract structural information from the viewing area. In [94], an RR quality metric is designed based on the wavelet transform, which is used for extracting features to simulate the psychological mechanisms of HVS. The study in [95] adopts the phase and magnitude of the 2D discrete Fourier transform to build an RR quality metric, which is motivated by the fact that the sensitivity of the HVS is frequency dependent. Recently, Zhai et al. [245] used the free-energy principle from cognitive processing to develop a psychovisual RR quality metric for images.
In RR metrics for images, various features can be extracted to measure the visual quality. The image distortion of some RR metrics is calculated based on features extracted from the spatial domain – such as color correlograms [99], image statistics in the gradient domain [101], structural information [77, 86], etc. Other RR metrics are proposed using features extracted in the transform domain – such as wavelet coefficients [98, 105, 106], coefficients from divisive normalization transform [104], DCT coefficients [102, 108], etc.
Some RR metrics are designed for specific distortion types. A hybrid image quality metric is designed by fusing several existing techniques to measure five specific artifacts in [79]. Other RR metrics are proposed to measure the distortion from JPEG compression [83], distributed source coding [96], etc.
NR quality metrics for images are proposed based on various features or specific distortion types. Many studies compute edge-extent features of images to build their NR quality metrics [136, 138, 140]. The natural scene statistics of DCT coefficients is used to measure the visual quality of images in [116]. In that metric, a Laplace probability density function is adopted to model the distribution of DCT coefficients. DCT coefficients are also used to measure blur artifacts [143], blockiness artifacts [134, 135], and so on. Similarly, some NR quality metrics for images use the features extracted by Fourier transform to measure different types of artifact – such as blur artifact [145], blockiness artifact [24], etc. Besides, NR quality metrics can be designed based on other features – such as sharpness [137], ringing [146], naturalness [148], and color [149]. In some NR quality metrics, blockiness, blurring, or ringing features are combined with other features, such as the bit-stream feature [151], edge gradient [152], and so on. The noise-estimation-based NR image quality metrics calculate MSE based on the difference between the proposed signal and the smoothing signal [126], or the variation within certain smooth regions in visual signals [129].
Compared with visual quality metrics for 2D images, quality metrics for 3D images have to consider additionally the depth perception. The HVS uses a multitude of depth cues, which can be classified into oculomotor cues coming from the eye muscles, and visual cues from the scene content itself [163, 246, 247]. The oculomotor cues include the factors of accommodation and vergence [246]. Accommodation refers to the variation of the lens shape and thickness, which allows the eyes to focus on an object at a certain distance, while vergence refers to the muscular rotation of the eyeballs, which is used to converge both eyes on the same object. There are two types of visual cue, namely monocular and binocular [246]. Monocular visual cues include relative size, familiar size, texture gradients, perspective, occlusion, atmospheric blur, lighting, shading, and shadows, motion parallax, etc. The most important binocular visual cue is the retinal disparity between points of the same objects viewed from slightly different angles by the eyes, which is used in stereoscopic 3D systems such as 3DTV.
Although 3D image quality evaluation is a challenging problem due to the complexities of depth perception, a number of 3D image quality metrics have been proposed [163]. Most existing 3D image quality metrics evaluate the distortion of 3D images by combining the evaluation results of a 2D image pair and additional factors – such as depth perception, visual comfort, and other visual experiences. In [248], 2D image quality metrics are combined with disparity information to predict the visual quality of 3D compressed images with blurring distortion. Similarly, [249] integrates the disparity information with 2D quality metrics for quality evaluation for 3D images. In [250], a 3D image quality metric is designed based on absolute disparity information. Existing studies also explore the visual quality assessment for 3D images based on characteristics of the HVS – such as contrast sensitivity [251], viewing experience [252], binocular visual characteristics [253], etc.
Furthermore, there are several NR metrics proposed for 3D image quality evaluation. In [254], an NR 3D image quality metric is built for JPEG-coded stereoscopic images based on segmented local features of artifacts and disparity. Another NR quality metric for 3D image quality assessment is based on the nonlinear additive model, ocular dominance model, and saliency-based parallax compensation [141].
The history and development of video quality metrics shares many similarities with image quality metrics, with the additional consideration of temporal effects.
As stated previously, there are two types of perceptual visual quality metrics: vision-based and signal-driven [2, 255–269]. Vision-based approaches include for example [22, 255], where HVS-based visual quality metrics for coded video sequences are proposed based on contrast sensitivity and contrast masking. The study in [47] proposes a JND-based metric to measure the visual quality for video sequences. In [263], a MOtion-based Video Integrity Evaluation (MOVIE) metric is proposed based on characteristics of the Middle Temporal (MT) visual area of the human visual cortex for video quality evaluation. In [266], an FR quality metric is proposed to improve the video evaluation performance of the quality metrics in [264, 265]. Vision-based approaches typically measure the distortion of the processed video signal in the spatial domain [49], DCT domain [21], or wavelet domain [50, 262].
Signal-driven video quality metrics are based primarily on the analysis of specific features or artifacts in video sequences. In [256], Wang et al. propose a video structural similarity index based on SSIM to predict the visual quality of video sequences. In that study, an SSIM-based video quality metric evaluates the visual quality of video sequences from three levels: the local region level, the frame level, and the sequence level. A similar metric for visual quality evaluation is proposed in [257]. In [258], another SSIM-based video quality metric is designed based on a statistical model of human visual speed perception described in [259]. In [260], an FR video quality metric is proposed based on singular value decomposition. In [51], the video quality metric software tools provide standardized methods to measure the perceived quality of video systems. With the general model in that study, the main distortion types include blurring, blockiness, jerky/unnatural motion, noise in luminance and chrominance channels, and error blocks. In [261], a video quality metric is proposed based on the correlation between subjective (MOS) and objective (MSE) results. In [270], a low-complexity video quality metric is proposed based on temporal quality variations. Some existing metrics make use of both classes (vision-based and signal-driven). The metric proposed in [271] combines model-based and signal-driven methods based on the extent of blockiness in decoded video. A model-based metric was applied to blockiness-dominant areas in [29], with the help of a signal-driven measure.
Recently, various High-Definition Television (HDTV) videos have emerged, with large demand from users and increased development of high-speed wideband network services. Compared with Standard-Definition Television (SDTV) videos, HDTV content needs higher-resolution display screens. Although the viewing distance of HDTV systems is closer in terms of image height compared with SDTV systems, approximately the same number of pixels per degree of viewing angle exists in both these systems due to the higher spatial resolution in HDTV [267]. However, the higher total horizontal viewing angle for HDTV (about 30°) may influence the quality decisions compared with that for SDTV (about 12°) [267]. Additionally, with high-resolution display screens for HDTV, human eyes roam the picture in order to track specific objects and their motion, which causes the visual distortion outside the immediate area of attention to be perceived less compared with SDTV [267]. With emerging HDTV applications, some objective quality metrics have been proposed to evaluate the visual quality of HDTV specifically. The study in [267] conducts experiments to assess whether the NTIA general video quality metric [51] can be used to measure the visual quality of HDTV video. In [268], spatiotemporal features are extracted from visual signals to estimate the perceived quality of distortion caused by compression coding. In [269], an FR objective quality metric is proposed based on a fuzzy measure to evaluate coding distortion in HDTV content. Recently, some studies have investigated the visual quality for Ultra-High Definition (UHD) video sequences by subjective experiments [272–274]. The study in [274] conducts subjective experiments to analyze the performance of the popular objective quality metrics (PSNR, VSNR, SSIM, MS-SSIM, VIF, and VQM) on 4K UHD video sequences; experimental results show the content-dependent nature of most objective metrics, except VIF [274]. The subjective experiments in [273] demonstrate that the HEVC-encoded YUV420 4K-UHD video at a bit rate of 18 Mb/s has good visual quality in the usage of legacy DTV broadcasting systems with single-channel bandwidths of 6 MHz. The study in [272] presents a set of 15 4K UHD video sequences for the requirements of visual quality assessment in the research community.
Video is a much more suitable application for RR metrics than images because of the streaming nature of the content and the much higher data rates involved. Typically, low-level spatiotemporal features from the original video are extracted as reference. Features from the reference video can then be compared with those from the processed video.
In the work performed by Wolf and Pinson [80], both spatial and temporal luminance gradients are computed to represent the contrast, motion, amount, and orientation of activity. Temporal gradients due to motion facilitate detecting and quantifying related impairments (e.g., jerkiness) using the time history of temporal features. The metric performs well in the VQEG FR-TV Phase II Test [275].
Some RR metrics for video are proposed based on specific features or properties of HVS in the spatial domain. In [87], an RR quality metric is designed based on a psychovisual color space from high-level human visual behavior for color video. The RR quality metric for video signals also takes advantage of contrast sensitivity [89]. The study in [92] incorporates the texture-masking property of the HVS. In [93], an RR quality metric is proposed based on features of SVD and HVS for wireless applications. An RR quality metric is proposed in [100], based on temporal motion smoothness. In [103], RR video quality metrics are proposed based on SSIM features.
Other RR metrics for video work directly in the encoded domain. In [81], blurring and blockiness from video compression are measured by a discriminative analysis of harmonic strength extracted from edge-detected images. In [80], RR quality metrics are proposed based on spatial and temporal features to measure the distortion occurring in standard video compression and communication systems. DCT coefficients are used to extract features for the perceptual quality evaluation of MPEG2-coded video in [82]. In [84], an RR quality metric is proposed based on multivariate data analysis to measure the artifacts of H.264/AVC video sequences. The RR quality metric in [107] also extracts features from DCT coefficients to measure the quality of distorted video. In [97], the differences between entropies of the wavelet coefficients of the reference and distorted video are calculated to measure the distortion of video signals.
For NR video quality measurement, many studies build their metrics based on direct estimation of MSE or PSNR caused by specific block-based compression standards such as MPEG2, H.264, etc. [112, 114, 117, 118, 120]. In [112], the PSNR is calculated from the estimated quantization error caused by compression for visual quality evaluation. The study in [114] estimates PSNR based on DCT coefficients of MPEG2 video for visual quality evaluation. The transform coefficients are modeled by different distributions for visual quality evaluation such as a Gaussian model [118], Laplace model [117], and Cauchy distribution [120]. Some NR quality metrics have tried to measure the MSE caused by packet-loss errors [121, 124]. Bit-stream-based approaches predict the quality of video from the compressed video stream with packet losses [121]. The NR quality metric in [124] is designed to detect packet loss caused by specific compression of H.264 and motion-JPEG 2000, respectively. The noise-estimation-based NR quality metrics calculate the MSE based on the variation within certain smooth regions in visual signals [129]. Other NR quality metrics incorporate the characteristics of HVS for measuring the quality of visual content [139].
Beside the direct estimation of the MSE, some feature-based NR video quality metrics have been proposed for video quality assessment. The features in NR quality metrics for video signals can be extracted from different domains. In [27, 132], a blockiness artifact is measured based on features extracted from the spatial domain, while [134] evaluates visual quality for video based on features in the DCT domain. Additionally, in NR video quality metrics, various types of feature are used to calculate the distortion in video signals. In [152], the edge feature is extracted from visual signals to build NR quality metrics. In the NR quality metrics of [143, 144], the blurring feature is extracted from DCT coefficients. Some studies propose NR video quality metrics by combining blockiness, blurring, and ringing features together [150, 151]. Beside, some specific NR quality metrics for video are proposed to measure flicker [153] or frame freezes [154, 156]. In [281], an NR quality metric for HDTV is proposed to evaluate blockiness and blur distortions.
Recently, some studies have investigated quality metrics for the emerging applications of 3D video processing. The experimental results of the studies in [276, 277] show that the 2D quality metrics can be used to evaluate the quality of 3D video content. The study in [190] discusses the importance of visual attention in 3DTV quality assessment. In [278], an FR stereo-video quality metric is proposed based on a monoscopic quality component and stereoscopic quality component. A 3D video quality metric is proposed based on the spatiotemporal structural information extracted from adjacent frames in [279]. Some studies also use characteristics of the HVS – including CSF, visual masking, and depth perception – to build perceptual 3D video quality metrics [28, 280]. Beside FR quality metrics, RR and NR quality metrics for 3D video quality evaluation have also been investigated in [78] and [133], respectively. However, 3D video quality measurement is still an open research area, because of the complexities of depth perception [246, 282].
Just like for images and video, traditional objective signal measures used for audio/speech quality assessment are built on basic mathematical measurements such as SNR, MSE, etc. They do not take psychoacoustic features of the HAS into consideration and thus cannot provide satisfactory performance compared with perceptual audio/speech quality assessment methods. Additionally, the shortcomings of traditional objective signal measures are evident in the non-linear and non-stationary codecs for audio/speech signals [3]. To overcome these drawbacks, various perceptual-based objective quality evaluation algorithms have been proposed based on characteristics of the HAS such as the perception of loudness, frequency, masking, etc. [3, 62, 177]. The amplitude of audio signals refers to the amplitude of the air pressure in the audio wave. The loudness is related to the amplitude of audio signals, which is perceived by listeners as the audio pressure level. The frequency of audio signals is measured in cycles per second (or Hz), and humans can perceive audio signals with frequencies in the range of 20 Hz to 20 kHz. Generally, the sensitivity of the HAS is frequency dependent. Auditory masking happens when the perception of one audio signal is affected by another audio signal. In the frequency domain, auditory masking is known as simultaneous masking, while it is known as temporal masking in the time domain.
Currently, most existing FR models (also called intrusive models) for audio/speech signals adopt perceptual models to transform both reference and distorted signals for feature extraction [63–66, 71, 178]. The quality of the distorted signal is estimated from the distance between features of the reference and distorted signals in transform domains. The NR models (also called non-intrusive models) estimate the quality of distorted speech signals without reference signals. Currently, there is no NR model for audio signals. Existing NR models for speech signals calculate the distortion results based on the signal production, signal likelihood, perception properties of noise loudness, etc. [62, 155, 283].
We are not aware of any RR metrics for audio/speech quality assessment in the literature.
Studies of FR audio/speech quality metrics began with the requirement of low-bit-rate speech and audio codecs [62]. Since the 1970s, many studies have adopted perception-based models in speech/audio codecs to optimize the coding distortions for minimum audibility rather than MSE for improved perceived quality [30]. In [63], a noise-to-mask ratio measure was designed based on a perceptual masking model by comparing the level of the coding noise with the reference signal. Other similar waveform difference measures include the research work in [64, 65]. The problem with these methods is that the estimated quality might be unreasonable for the distorted signals with substantial changes of signal waveform, which would result in large waveform differences [62]. To overcome this problem, researchers have tried to extract signal features in the transform domain, which is consistent with the hypothetical representation of the signal in the brain or peripheral auditory system. One successful approach is the auditory spectrum distance model [66], which is widely used in ITU standards [65, 71, 178]. In that model [66], the features of peripheral hearing in the time and frequency domain are extracted for quality evaluation based on psychoacoustic theory. The study in [67] adopts a model of HAS to calculate the internal representation of audio signals for quality evaluation based on the psychophysical domain. In these models, the time signals are first mapped into the time frequency domain, and then smeared and compressed to get two time-frequency loudness density functions [62]. These density functions are passed to a cognitive model interpreting their differences with possible substantial additional processing [62]. Generally, the cognitive model is trained by a large training database and should be validated by test data. These perceptual quality metrics show promising prediction performance for many aspects of psychoacoustic data due to the use of psychoacoustic theories [66, 67]. Other studies try to improve the performance of existing metrics by using more detailed or advanced HAS models [17, 68–70, 73].
For audio quality assessment, the study in [284] calculates the probability of detected noise as a function of time for the coded audio signals. The study in [55] develops a model of the human ear in perceptual coding of audio signals. A frequency response equalization process is used in [179] for the quality assessment of audio signals. The study in [56] proposes an advanced quality metric based on a wide range of perceptual transformations. Some studies have tried to predict the perceived quality of audio signals based on the estimation of frontal spatial fidelity and surround spatial fidelity of multichannel audio [76], new distortion parameters and a cognitive model [57], and a multichannel expert system [58]. Recently, several perceptual objective metrics for audio signals have been proposed using an energy equalization approach [59, 60] and mean structural similarity measure [18].
Speech quality assessment has an even longer history, with many metrics [61, 64, 66, 68, 69, 73–75, 285]. One early perceptual speech quality metric was proposed by Karjalainen based on the features of peripheral hearing in time and frequency known from psychoacoustic theory [66]. Later, a simple approach known as the Perceptual Speech Quality Measure (PSQM) was proposed for the standard ITU-T P.861 [65]. Different from earlier models, PSQM improved its salient interval processing, giving less emphasis to noise in silent periods than during speech, and its use of asymmetry weighting [62]. The drawback of PSQM and other early models is that they are trained on subjective tests of generic speech codecs, and thus their performance is poor with some types of telephone network [62]. To address this problem, some objective metrics for speech signals were proposed with specific telephone network conditions [74, 285, 286]. Several more recent FR speech quality metrics have been proposed based on Bayesian modeling [31], adaptive feedback canceller [32], etc.
NR speech quality evaluation is more challenging due to the lack of reference signals. However, NR models are much more useful in practical applications such as wireless communications, voice over IP, and other in-service networks requiring speech quality monitoring, where the reference signal is unavailable. Currently, there is no NR quality metric for general audio signals. There are some studies trying to propose NR quality metrics for speech signals based on specific features.
Several NR speech quality metrics are designed based on specific distortions introduced by standard codecs or specific transmission networks. An early NR speech quality evaluation metric is built based on the spectrogram of the perceived signal for wireless communication [113]. The speech quality metric in [115] adopts Gaussian Mixture Models (GMMs) to create an artificial reference model to compare the degraded speech for quality evaluation; whereas in [119] speech quality is predicted by Bayesian inference, and Minimum Mean Square Error (MMSE) estimation based on a trained set of GMMs. In [131], a perceptually motivated speech quality metric is presented based on a temporal envelope representation of speech. The study [123] proposes a low-complexity NR speech quality metric based on features extracted from commonly used speech coding parameters (e.g., spectral dynamics). The features are extracted globally and locally to design an NR speech quality metric in [109]. Machine learning techniques are adopted to predict the quality of distorted speech signals in [128].
There are also other studies developing NR quality metrics that assess the quality of noise-suppressed speech signals. An NR speech quality metric is proposed in [127] based on Kullback–Leibler distances for noise-suppressed speech signals. In [125], an NR speech quality metric is built for noise-suppressed speech signals based on mel-filtered energies and support vector regression.
Generally, we watch video with an accompanying soundtrack. Therefore, comprehensive audiovisual quality metrics are required to analyze both modalities of multimedia content together. Audiovisual quality comprises two factors: synchronization between the two media signals (i.e., lip-sync) and interaction between audio and video quality [5, 44]. Currently, various research studies have been performed for audio/video synchronization. In actual lip-sync experiments, viewers perceive audio and video signals to be in sync up to about 80 ms of delay [287]. There is a consistently higher tolerance for video ahead of audio rather than vice versa, probably since this is also a more natural occurrence in the real world, where light travels faster than sound. Similar results were reported in experiments with non-speech clips showing a drummer [288]. The interaction between audio and video signals is another factor influencing the overall quality assessment of multimedia content, as shown by studies from neuroscience [289]. In [289], Lipscomb claims that at least two implicit judgments are made during the perceptual processing of the video experience: an association judgment and a mapping of accent structures. Based on the experimental results, the importance of synchronization decreases with more complicated audiovisual content for the interaction effect from audio and video signals [289].
Since most existing audiovisual quality metrics are proposed based on a combination of audio and video quality evaluation, the study in [44] analyzes the mutual influence between audio quality, video quality, and audiovisual quality. Based on the experimental analysis, the study obtains several general conclusions as follows. Firstly, both audio quality and video quality contribute to the overall audiovisual quality and their multiplication gets the highest correlation with the overall quality. Secondly, the overall quality is dominated by the video quality in general, whereas audio quality is more important than video quality in cases where the bit rates of both coded audio and video are low, or the video quality is larger than some certain threshold. With decreasing audio quality, the influence of audio quality increases in the overall quality. Additionally, with applications in which audio is obviously more important than video content (such as teleconference, news, music video, etc.), audio quality dominates the overall quality. Finally, audiovisual quality is also influenced by other factors, including motion information and complexity of the video content [44].
In [290], subjective experiments were carried out on audio, video, and audiovisual quality with results demonstrating that both audio and video quality contribute significantly to perceived audiovisual quality. The study also shows that the audiovisual quality can be evaluated with high accuracy by linear or bilinear combination from audio and video quality evaluation. Thus, many studies have adopted linear combination from audio and video quality evaluation to evaluate the quality of audio/video signals [291, 292].
Studies on audio/video quality metrics have focused mainly on low-bit-rate applications such as mobile communications, where the audio stream can use up a significance part of the total bit rate [293, 294]. Audio/video synchronization is incorporated, beside the fusion of audio and video quality in the audiovisual model proposed in [295]. Some studies focus on audiovisual quality evaluation for video conference applications [291, 292, 296]. The study in [297] presents a basic audiovisual quality metric based on subjective experiments on multimedia signals with simulated artifacts. The test data used in these studies is quite different in terms of content range and distortion, and these models obtain good prediction performance. In [142], an NR audiovisual quality metric is proposed to predict audiovisual quality and obtain good prediction performance. The study in [298] presents a graph-based perceptual audiovisual quality metric based on the contributions from modalities (audio and video) as well as the contribution of their relation. Some studies propose an audiovisual quality metric based on semantic analysis [299, 300].
Although there are some studies investigating audiovisual quality metrics, the progress of joint audiovisual quality assessment has been slow. The interaction between audio and video perception is complicated, and the perception of audiovisual content still lacks deep investigation. Currently, there are many quality metrics proposed based on the linear fusion of audio and video quality, but most studies choose fusion parameters empirically without theoretical support and little if any integration in the metric computation. However, audiovisual quality assessment is worthy of further investigation due to its wide application in signal coding, signal transmission, etc.
Currently, traditional signal fidelity metrics are still widely used to evaluate the quality of multimedia content. However, perceptual quality metrics have shown promise in quality assessment, and a large number of perceptual quality assessment metrics have been proposed for various types of content, as introduced in this chapter. During the past ten years, some perceptual quality metrics have gained popularity and have been used in various signal-processing applications, such as SSIM. In the past, a lot of effort focused on designing FR metrics for audio or video. It is not easy to obtain good evaluation performance with RR or NR quality metrics. However, effective NR metrics are much desired, with more and more multimedia content (such as image, video, or music files) being distributed over the Internet today. The widely used Internet transmission and new compression standards bring many new challenges for multimedia quality evaluation, such as new types of transmission loss and compression distortions. Additionally, various emerging applications of 3D systems and displays require new quality metrics. Depth perception in particular should be investigated further for 3D quality evaluation. Other substantial quality evaluation topics include the quality assessment for super-resolution images/video and High Dynamic Range (HDR) images/video. All these emerging content types and their corresponding processing methods bring with them many challenges for multimedia quality evaluation.
Contrast Sensitivity Function
Discrete Cosine Transform
Discrete Wavelet Transform
Fast Fourier Transform
Full Reference
Gaussian Mixture Model
Human Auditory System
High Dynamic Range
High-Definition Television
High-Efficiency Video Coding
Human Visual System
Internal Generative Mechanism
Just-Noticeable Difference
Mean Absolute Error
Minimum Mean Square Error
Mean Opinion Score
Model Output Variable
Motion-Based Video Integrity Evaluation
Mean Square Error
Middle Temporal
No Reference
Overall Difference Grade
Perceptual Evaluation of Audio Quality
Point Spread Function
Peak SNR
Perceptual Speech Quality Measure
Reduced Reference
Standard-Definition Television
Signal-to-Noise Ratio
Structural Similarity
Singular Value Decomposition
Ultra-High Definition
Visual Attention
Visible Differences Predictor
Visual Information Fidelity
Visual Signal-to-Noise Ratio