3
Review of Existing Objective QoE Methodologies

Yuming Fang1, Weisi Lin1 and Stefan Winkler2

1Nanyang Technological University, Singapore

2Advanced Digital Sciences Center (ADSC), Singapore

3.1 Overview

Quality evaluation for multimedia content is a basic and challenging problem in the field of multimedia processing, as well as various practical applications such as process evaluation, implementation, optimization, testing, and monitoring. Generally, the quality of multimedia content is affected by various factors such as acquisition, processing, compression, transmission, output interface, decoding, and other systems [1–3]. The perceived quality of impaired multimedia content depends on various factors: the individual interests, quality expectations, and viewing experiences of the user; output interface type and properties; and so on [2–5].

Since the Human Visual System (HVS) and the Human Auditory System (HAS) are the ultimate receiver and interpreter of the content, subjective measurement represents the most accurate method and thus serves as the benchmark for objective quality assessment [2–4]. Subjective experiments require a number of subjects to watch and/or listen to the test material and rate its quality. The Mean Opinion Score (MOS) is used for the average rating over all subjects for each piece of multimedia content. A detailed discussion of subjective measurements can be found in Chapter 6. Although subjective experiments are accurate for the quality evaluation of multimedia content, they suffer from certain important drawbacks and limitations – they are time consuming, laborious, expensive, and so on [2]. Therefore, many objective metrics have been proposed to evaluate the quality of multimedia content in past decades. Objective metrics try to approximate human perceptions of multimedia quality. Compared with subjective viewing results, objective metrics are advantageous in terms of repeatability and scalability.

Objective quality evaluation methods can be classified into two broad types: signal fidelity metrics and perceptual quality metrics [2]. Signal fidelity metrics evaluate the quality of the distorted signal by comparing it with the reference without considering the signal content type, while perceptual quality metrics take the signal properties into consideration together with the characteristics of the HVS (for image and video content) or HAS (for audio signals). Signal fidelity metrics include traditional objective quality assessment methods such as MAE (Mean Absolute Error), MSE (Mean Square Error), SNR (Signal-to-Noise Ratio), PSNR (Peak SNR), or one of their relatives [6]. These traditional objective metrics are widely accepted in the research community for several reasons: they are well defined, and their formulas are simple and easy to understand and implement. From a mathematical point of view, minimizing MSE is also well understood.

Although signal fidelity metrics are widely used to measure the quality of signals, they generally are poor predictors of perceived quality with non-additive noise distortions [7, 8]. They only have an approximate relationship with the quality perceived by human observers, since they are mainly based on byte-by-byte comparison without considering what each byte represents [3, 4]. Signal fidelity metrics essentially ignore the spatial and temporal relationship in the content. It is well accepted that signal fidelity metrics do not align well with human perceptions of multimedia content for the following reasons [2, 3, 6, 9–11]:

  1. Not every change in multimedia content is noticeable.
  2. Not every region in multimedia content receives the same attention level.
  3. Not every change yields the same extent of perceptual effect with the same magnitude of change.

To overcome the problems of signal fidelity metrics, a significant amount of effort has been spent trying to design more logical, economical, and user-oriented perceptual quality metrics [3, 5, 11–18]. In spite of the recent progress in related fields, objective evaluation of signal quality in line with human perceptions is still a long and difficult odyssey [3, 5, 12–16] due to the complex multidisciplinary nature of the problem (related to physiology, psychology, vision research, audio/speech research, and computer science), the limited understanding of human perceptions, and the diverse scope of applications and requirements. However, with proper modeling of major underlying physiological and psychological phenomena, it is now possible to develop better-quality metrics to replace signal fidelity metrics, starting with various specific practical situations.

This chapter is organized as follows. Section 3.2 provides an introduction to the quality metric taxonomy. In Section 3.3, the basic computational modules for perceptional quality metrics are given. Sections 3.4 and 3.5 introduce the existing quality metrics for images and video, respectively. Quality metrics for audio/speech are described in detail in Section 3.6. Section 3.7 presents joint audiovisual quality metrics. The final section concludes.

3.2 Quality Metric Taxonomy

Existing quality metrics can be classified according to different criteria, as depicted in Fig. 3.1. There are basically two categories of perceptual metrics [5], relying on a perception-based approach or a signal-driven approach. For the first category [19–22], objective metrics are built upon relevant psychophysical properties and physiological knowledge of the HVS or HAS, while the signal-driven approach evaluates the quality of the signal from the aspect of signal extraction and analysis. Among the psychophysical properties and physiological knowledge used in perception-based approaches, the Contrast Sensitivity Function (CSF) models the HVS's sensitivity toward signal contrast with spatial frequencies and temporal motion velocities, and exhibits a parabola-like curve with increasing spatial and temporal frequencies, respectively; luminance adaptation refers to the noticeable luminance contrast as a function of background luminance; visual masking is usually the increase in HVS contrast threshold for visual content in the presence of another one, and can be divided into intra-channel masking by the visual content itself and inter-channel masking by visual content with different frequencies and orientations [1, 2, 4, 5]. For audio/speech signals, the commonly used psychophysical properties or physiological knowledge of the HAV include the effects of the outer and middle ear, simultaneous masking, forward and backward temporal masking, etc. [3].

images

Figure 3.1 The quality metric taxonomy

Since the perception-based approaches involve high computational complexity and there are difficulties in bridging the gap between vision research and the requirements of engineering modeling, more recent research efforts have been directed at signal-driven perceptual quality metrics. Compared with perception-based approaches, signal-driven ones do not need to model human perception characteristics. Instead, signal-driven approaches attempt to evaluate quality from aspects of signal extraction and analysis, such as statistical features [23], structural similarity [24], luminance/color distortion [25], and common artifacts [26, 27]. These metrics also consider the effects of human perception by content and distortion analysis, instead of fundamental bottom-up perception modeling.

Another classification of objective quality metrics is the availability of the original signal, which is considered to be distortion free or of perfect quality, and might be used as reference signal to evaluate the distorted signal. Based on the availability of the original signal, the quality metrics can be divided into three categories [1, 2]: Full-Reference (FR) metrics, which require the processed signal and the complete reference signal [28–36], Reduced-Reference (RR) metrics, which require the processed signal and only part of the reference signal [23, 37], and No-Reference (NR) metrics, which require only the processed signal [38–41]. Traditional signal fidelity metrics for quality evaluation are FR metrics. Most perceptual quality metrics are of the FR type, including most perception-driven quality metrics and many signal-driven visual quality metrics. Most perceptual audio/speech quality metrics are FR. Generally, FR quality metrics can measure the quality of signals more accurately compared with RR or NR metrics, since they have more information available.

3.2.1 Full-Reference Quality Metrics

FR quality metrics evaluate the quality of the processed signal with respect to the reference signal. The traditional signal fidelity metrics – such as MSE, SNR, and PSNR – are early FR metrics. They have been the dominant quantitative performance metrics in the field of signal processing for decades. Although they exhibit poor accuracy when dealing with perceptual signals, they are still the standard criterion and widely used. Signal fidelity metrics in quality assessment try to provide a quantitative score that describes the level of error/distortion for the processed signal by comparing it with the reference signal. Suppose that X = {xi|i = 1, 2, ..., N} is a finite-length, discrete original signal (reference signal) and is the corresponding distorted signal of the original signal, where N is the signal length, and xi and are the values of the ith samples in X and , respectively. The MSE between the distorted and the reference signals is calculated as

where is used as the quality measurement of the distorted signal . A more general form of the distortion is the lp norm [11]:

The PSNR measure can be obtained from MSE as

where MAX is the maximum possible signal intensity value.

Aside from traditional signal fidelity metrics, many perceptual metrics are also FR metrics. As introduced previously, the perception-driven approach of quality assessment mainly measures the quality of images/video by modeling the characteristics of HVS. In the simplest perception-driven approaches, the HVS is considered as a single spatial filter modeling the CSF [42–45]. Many more sophisticated perception-driven approaches of FR metrics try to incorporate local contrast, spatiotemporal CSF, contrast/activity masking, and other advanced HVS functions to build quality metrics for images/video [19–22, 46–50]. Compared with perception-driven approaches, signal-driven approaches for quality assessment are relatively less sophisticated and thus computationally inexpensive. Currently, there are many FR metrics for signal-driven approaches to visual quality assessment [35, 36, 51–54]. Similarly, most perceptual audio/speech quality metrics are FR [3, 55–76]. Early audio quality metrics were designed for low-bit-rate speech and audio codecs. Perceptual-based models were used to optimize distortion for minimum audibility rather than traditional signal fidelity metrics, leading to an improvement in perceived quality [30]. Based on the characteristics of the HAS, various perceptual FR audio/speech quality metrics have been proposed [17, 63–76]. A detailed discussion of audio/speech quality metrics will be provided in Section 3.5.

Generally, FR metrics require the complete reference signal, usually in unimpaired and uncompressed form. This requirement is quite a heavy restriction for practical applications. Furthermore, FR metrics generally impose a precise alignment of the reference and distorted signals, so that each sample in the distorted signal can be matched with its corresponding reference sample. For video or audio signals, temporal registration in particular can be very difficult to achieve in practice due to the information loss, content repeats, or variable delays introduced by the system. Aside from the issue of spatial and temporal alignment, FR metrics usually do not respond well to global shifts in certain features (such as brightness, contrast, or color), and require a corresponding calibration of the signals. Therefore, FR metrics are most suitable for offline signal quality measurements such as codec tuning or lab testing.

3.2.2 Reduced-Reference Quality Metrics

RR quality metrics only require some information about the reference (e.g., in the form of a number of features extracted from the reference signal) for quality assessment tasks [77–108]. Normally, the more reference information is available to an RR metric, the more accurate the predictions it can make. In the extreme, when the rate is large enough to reconstruct the original signal, RR quality metrics converge to FR metrics.

Based on the underlying design philosophy, RR quality metrics can be classified into three approaches [77]: modeling the signal distortion, using characteristics or theories of the human perception system, and analysis of signal source. The last two types can be considered as general-purpose metrics, since the statistical and perceptual features are not related to any specific type of signal distortion.

The first type of RR quality metrics based on modeling signal distortion is developed mainly for specific applications. Straightforward solutions can be provided by these methods when there is sufficient knowledge about the processing of the content. With signal distortion from standard image or video compression, RR quality metrics can define the typical distortion artifacts of blurring and blockiness to measure the quality of the related visual content [79, 81]. Various RR quality metrics are proposed to measure the distortions occurring in standard compression systems such as MPEG2-coded video [82], JPEG images [83], H.264/AVC-coded video [84], etc. The drawback with this kind of RR quality metrics is the generalization capability, since they are always designed for certain specific kinds of signal distortion.

RR quality metrics based on the human perceptual system measure content quality by extracting perceptual features, where computational models of human perception may be employed. In [86], RR quality metrics are proposed by extracting perceptual features from JPEG and JPEG2000 images and obtain good evaluation performance. Many RR quality metrics are designed for video quality evaluation based on various characteristics of HVS such as color perception theory [87], CSF [88–90], structural information perception [91], texture masking [92, 93], etc. Aside from the features extracted in the spatial domain, there are also some RR quality metrics built based on features extracted by contourlet transform [90], wavelet transform [94], Fourier transform [95], etc.

The third type of RR quality metrics measure content quality based on models of the signal source. Since the reference signal is not available in a deterministic sense, these models are often based on statistical knowledge and capture certain statistical properties of the natural scenes [77, 85]. The distortions disturb the statistical properties of the natural scenes in unnatural ways, which can be measured by the statistical models of natural scenes. Many RR quality metrics are designed based on the difference calculation from feature distributions of color [99], motion [100], etc. This type of metrics can also be designed based on features in the transform domain such as divisive normalization transform [104], wavelet transform [105, 106], Discrete Cosine Transform (DCT) [102, 107, 108], etc.

RR approaches make it possible to avoid some of the assumptions and pitfalls of pure no-reference metrics while keeping the amount of reference information manageable. Similar to FR metrics, RR metrics also have alignment requirements. However, they are typically less stringent than full-reference metrics, as only the extracted features need to be aligned. Generally, RR metrics are better suited for monitoring in-service content at different points in the distribution system.

3.2.3 No-Reference Quality Metrics

Compared with FR and RR quality metrics, NR quality metrics do not require any reference information [109–156]. Thus, they are highly desirable in many practical applications where reference signals are not available.

A number of methods have been proposed to predict the MSE caused by certain specific compression schemes such as MPEG2 [112, 114], JPEG [116], or H.264 [117, 118, 120]. These methods use information from the bit stream directly, except the study [112] which adopts the decoded pixel information. The DCT coefficients in these techniques are usually modeled by Laplacian, Gaussian, or Cauchy distributions. The main drawback of these techniques is that they measure the distortion for each 8×8 block without considering the differences from neighboring blocks [110]. Another problem with these methods is that the performance decreases with lower bit rate due to more coefficients quantized to zero. Some NR visual quality metrics have tried to measure the MSE caused by packet loss errors [121, 124, 139], the difference between the processed signal and the smoothed signal [126], the variation within the smooth regions in the signal [129], etc.

Generally, NR quality metrics assume that the statistics of the processed signals are different from those of the original and extract features from the processed signals to evaluate model compliance [85, 110]. NR quality metrics can be designed based on features of various domains, such as the spatial domain [27, 132], Fourier domain [24], DCT domain [134, 135], or polynomial transform [136]. Additionally, many NR quality metrics are based on various features from visual content, such as sharpness [137], edge extent [138, 140], blurring [143, 144], phase coherence [145], ringing [146, 147], naturalness [148], or color [149]. In some NR quality metrics, blockiness, blurring, or ringing features are combined with other features, such as bit-stream features [150, 151], edge gradient [152], and so on. Compared with image content, temporal features have to be considered for quality assessment of video and audio signals. Various NR quality metrics for video content are proposed to measure flicker [153] or frame freezes [154, 156]. There are also some NR metrics proposed for speech quality evaluation [109, 113, 115, 119, 122, 123], which are mainly designed based on analysis of the audio spectrum.

NR metrics analyze the distorted signal without the need for an explicit reference signal. This makes them much more flexible than FR or RR metrics, as it can be difficult or impossible to get access to the reference in some cases (e.g., video coming out of a camera). They are also completely free from alignment issues. The main challenge of NR metrics lies in telling apart distortions from content, a distinction humans are usually able to make from experience. NR metrics always have to make assumptions about the signal content and/or the distortions of interest. This comes with a risk of confusing the actual content with distortions (e.g., a chessboard could be interpreted as a block of artifacts under certain conditions). Additionally, most NR quality metrics are designed for specific and limited types of distortion. They can face difficulties in modern communication systems, where distortions could be a combination of compression, adaptation, network delay, packet loss, and various types of process filtering. NR metrics are suited for monitoring in-service content at different points in the distribution system, since they do not require reference signals.

3.3 Basic Computational Modules for Perceptual Quality Metrics

Since the traditional signal fidelity metrics assess the quality of the distorted signal by simply comparing it with the reference one and they are well introduced in Section 3.2.1, here we just provide the basic computational modules for perceptual quality metrics. Generally, these include signal decomposition (e.g., decomposing an image or video into different color, spatial, and temporal channels), detection of common features (like contrast and motion) and artifacts (like blockiness and blurring), just-noticeable distortion (i.e., the maximum change in visual content that cannot be detected by the majority of viewers), Visual Attention (VA) (i.e., the HVS's selectivity in responding to the most interesting activities in the visual field), etc. First, many of these are based on related physiological and psychological knowledge. Second, most are independent research topics themselves, like just-noticeable distortion and VA modeling, and have other applications (image/video coding [157], watermarking [158], error resilience [159], computer graphics [160], to name just a few), in addition to perceptual quality metrics. Third, these modules can be simple perceptual quality metrics themselves in specific situations (e.g., blockiness and burring).

3.3.1 Signal Decomposition

Most perception-driven quality metrics use signal decomposition for feature extraction. Signal feature extraction and common artifact detection are at the core of many signal-driven quality metrics; the perceptual effect of common artifacts far exceeds the extent of their representation in MSE or PSNR. Just-noticeable distortion and VA models have been used either independently or jointly to evaluate the visibility and perceived extent of visual content differences. Therefore, all these techniques help to address the three basic problems (as mentioned at the beginning of this chapter) to be overcome relative to traditional signal fidelity metrics, since they enable the differentiation of various content changes for perceptual quality-evaluation purposes.

For images and video, the process of signal decomposition refers to the decomposition of visual content into different channels (spatial, frequency, and temporal) for further processing. It is well known that the HVS has separate processing for achromatic and chromatic components, different pathways for visual content with different motion, and special cells in the visual cortex for distinctive orientations [161]. Existing psychophysical studies also show that visual content is processed differently in the HVS by frequency [162] and orientation [163, 164]. Thus, decomposition of an image or video frame into different color, spatial, and temporal channels can evaluate content changes for unequal treatment of each channel to emulate the HVS response, which can address the third problem of traditional signal fidelity metrics mentioned at the beginning of this chapter.

Currently, there are various signal decomposition methods for color [165–168]. Two widely accepted color spaces in quality assessment are the opponent-color (black/white, red/green, blue/yellow) space [22, 166] based on physiological evidence of opponent cells in the parvocellular pathway and CIELAB space [167] based on human perceptions of color differences. With compressed visual content, YCbCr space is more convenient for feature extraction due to its wide use in image/video compression standards [53, 165, 168]. Other color spaces have also been used, such as YOZ [21]. In many metrics, only the luminance component of the visual content is used for efficiency [49, 169–171], since it is generally more important for human visual perception than chrominance components, especially in quality evaluation of compressed images (it is worthwhile pointing out that most coding decisions in current image/video compression algorithms are made based on luminance manipulation).

Temporal decomposition is implemented by a sustained (low-pass) filter and transient (band-pass) filters [46, 172] to stimulate two different visual pathways. Based on the fact that receptive fields in the primary visual cortex resemble Gabor patterns [173] that can be characterized by a particular spatial frequency and orientation, many types of spatial filter can be used to decompose each temporal channel, including Gabor filters, cortex filters [174], wavelets, Gaussian pyramid [175], and steerable pyramid filters [46, 176].

For audio/speech signals, signal decomposition is implemented based on the properties of the peripheral auditory system – such as the perception of loudness, frequency, masking, etc. [177]. In the Perceptual Evaluation of Audio Quality (PEAQ) ITU standard [178], two psychoacoustic models are adopted to transform the time-domain input signals into a basilar membrane representation for further processing: the FFT (Fast Fourier Transform)-based ear model and the filter-bank-based ear model [3]. For the FFT-based ear model, the input signal is first transformed into the frequency domain. The amplitude of the FFT is used for further processing. Then the characteristics of the effect in the outer and middle ear on audio signals are modeled based on Terhardt's approach for frequency components [38]. After that, the frequency components are grouped into critical frequency bands as perceived by the HAS. An internal ear-noise model is used to obtain the pitch patterns for audio signals [3]. These pitch patterns are smeared out over the frequencies by a spreading function modeling simultaneous masking. Finally, the forward masking characteristics of temporal masking effects are approximated by a simple first-order low-pass filter. In the filter-bank-based ear model, the audio signal is processed in the time domain [3]. Compared with the FFT-based ear model, the filter-bank-based ear model adopts a finer time resolution, which makes the modeling of backward masking possible and thus maintains the fine temporal structure of the signal. PEAQ is mainly based on the model in [179]. First, the input audio signal is decomposed into band-pass signals by a filter bank of equally spaced critical bands. Similar to the FFT-based ear model, the effects in the outer and middle ear on audio signals are modeled. Then the characteristics of simultaneous masking, backward masking, internal ear noise, and forward masking are modeled subsequently for the final feature extraction for audio signals [3].

Similar signal decomposition methods based on psychoacoustic models have been implemented in many studies [67, 75]. During the transformation, the input audio signals are decomposed into different band-pass signals by modeling various characteristics in the HAS, such as characteristics of effect of the outer and middle ear [38], simultaneous masking (frequency spreading) [180], forward and backward temporal masking effects [3, 122], etc. Following PEAQ, some other new psychoacoustic models have been proposed by incorporating recent findings into the design of perceptual audio quality metrics [17, 70].

3.3.2 Feature and Artifact Detection

The process of feature and artifact detection is common for visual quality evaluation in various scenarios. For example, meaningful visual information is conveyed by feature contrast such as luminance, color, orientation, texture, motion, etc. There is little or no information in a largely uniform image. The HVS perceives much more from signal contrast than from absolute signal strength, since there are specialized cells to process this information [181]. This is also the reason why contrast is central to CSF, luminance adaptation, contrast masking, visual attention, and so on.

For audio/speech quality metrics, various features are extracted for artifact detection in transform domains such as modulation, loudness, excitation, and slow-gain variation features [3, 178]. For NR metrics of speech quality assessment, the perceptual linear prediction coefficients are used for quality evaluation [115, 128]. In [155], the vocal tract and unnaturalness features of speech are extracted from speech signals for quality evaluation. In PEAQ, the cognitive model processes the parameters from the psychoacoustic model to obtain Model Output Variables (MOVs) and maps them to a single Overall Difference Grade (ODG) score [3]. The MOVs are extracted based on various parameters including loudness, amplitude modulation, adaption, masking, etc., and they also model concepts such as linear distortion, bandwidth, modulation difference, noise loudness, etc. The MOVs are used as input to the neural network and mapped to a distortion index. Then the ODG is calculated based on the distortion index to estimate the quality of the audio signal.

There are certain structural artifacts occurring in the prevalent signal compression and delivery process which result in annoying effects for the viewer. The common structural artifacts caused by coding include blockiness, blurring, edge damage, and ringing [171], whose perceptual effect is ignored in traditional signal fidelity metrics such as MSE and PSNR. In fact, uncompressed images/video usually include blurring artifacts due to the imperfect PSF (Point Spread Function) and an out-of-focus imaging system as well as object motion during the signal capture process [182]. In video quality evaluation, the effects of motion and jerkiness have been investigated [156, 183]. Similarly, coding distortions in audio/speech signals have been well investigated [30, 63–65]. Some studies have investigated the quality evaluation of noise-suppressed audio/speech [125].

Another type of quality metrics is designed specifically to measure the impact of network losses on perceptual quality. This development is the result of increasing multimedia service delivery over IP networks, such as Internet streaming or IPTV. Since information loss directly affects the encoded bit stream, such metrics are often designed based on parameters extracted from the transport stream and the bit stream with no or little decoding. This has the added advantage of much lower data rates and thus lower bandwidth and processing requirements compared with metrics looking at the fully decoded video/audio. Using such metrics, it is thus possible to measure the quality of many video/audio streams or channels in parallel. At the same time, these metrics have to be adapted to specific codecs and network protocols. Due to the different types of features and artifacts, so-called “hybrid” metrics use a combination of different features or approaches for quality assessment [5]. Some studies explore the joint impact of packet loss rate and MPEG-2 bit rate on video quality [184], the influence of bit-stream parameters (such as motion vector length or number of slice losses) on the visibility of packet losses in MPEG-2 and H.264 videos [139], the joint impact of the low-bit-rate codec and packet loss on audio/speech quality [185], etc.

In some quality metrics, multiple features or quality evaluation approaches are combined. In [152], several structural features such as blocking, blurring, edge-based image activity, gradient-based image activity, and intensity masking are linearly combined for quality evaluation. The feature weights are determined by a multi-objective optimization method [152]. In [169], the input visual scene is decomposed into predicted and disorderly portions for quality assessment based on an internal generative mechanism in the human brain. Structure similarity and PSNR metrics are adopted for quality evaluation in these two portions respectively, and the overall score is obtained by combining these two results with an adaptive nonlinear procedure [169]. In [186], phase congruency and gradient magnitude are employed as two complementary roles for the quality assessment of images. After calculating the local quality map, the phase congruency is adopted again as a weighting function to derive the overall score [186]. The study in [187] presents a visual quality metric based on two strategies in the HVS: the detection-based strategy for high-quality images containing near-threshold distortions and the appearance-based strategy for low-quality images containing clearly supra-threshold distortions. Different measurement methods are designed for these two types of quality level, and the overall quality evaluation score is obtained by combining the results adaptively [187]. In [70], the linear and nonlinear distortions in the perceptual transform (excitation) domain are combined linearly for speech quality evaluation. In [76], the audio quality evaluation results from spectral and spatial features are multiplied to obtain the overall quality of audio signals.

Recently, a new fusion method for different features or quality evaluation approaches has emerged based on machine learning techniques [111, 125, 188, 189]. In [111], machine learning is adopted for the feature pooling process in visual quality assessment to address the limitations of existing pooling methods such as linear combination. A similar pooling process by support vector regression is introduced for speech quality assessment in [125]. Rather than using machine learning techniques for feature pooling, a multi-method fusion quality metric is introduced based on the nonlinear combination of scores from existing methods with suitable weights from a training process in [189]. In some NR quality metrics, machine learning techniques are also adopted to learn the mapping from feature space to quality scores [190].

3.3.2.1 Contrast

In [191], band-pass-filtered and low-pass-filtered images are used to evaluate the local image contrast. Following this methodology, image contrast is calculated as the ratio of the combined analytic oriented filter response to the low-pass filtered image in the wavelet domain [192] or the ratio of high-pass response in the Haar wavelet space [193]. Luminance contrast is estimated as the ratio of the noticeable pixel change to the average luminance in a neighborhood [53]. The contrast can also be computed as a local difference between the reference video frame and the processed one with the Gaussian pyramid decomposition [47], or the comparison between DCT amplitudes and the amplitude of the DC coefficient of the corresponding block [21]. The k-means clustering algorithm can be used to group image blocks for the calculation of color and texture contrast [194], where the largest cluster is considered as the image background. The contrast is then computed as the Euclidean distance from the means of the corresponding background cluster. Motion contrast can be obtained by relative motion, which is represented by object motion against the background [194].

3.3.2.2 Blockiness

Blockiness is a prevailing degradation caused by the block-based DCT coding technique, especially under low-bit-rate conditions, due to the different quantization sizes used in neighboring blocks and the lack of consideration for inter-block correlation. Given an image I with width W and height H, which is divided into N × N blocks, the horizontal and vertical difference at block boundaries can be computed as [27]

The method in [27] works only for a regular block structure with certain block size. It cannot work in modern video codecs (e.g., HEVC (High Efficiency Video Coding, H.265)) due to the different block sizes used. Other, similar calculation methods for blockiness can be found in [25, 195]. During the blockiness calculation, object edges at block boundaries can be excluded [29]. Luminance adaptation and texture masking have recently been considered for blockiness evaluation [135]. Another method for gauging blockiness is based on harmonic analysis [196], which can be used in the case when block boundary positions are unknown beforehand (e.g., with video being cropped, retaken by a camera, or coded with variable block sizes).

3.3.2.3 Blurring

Blurring can be evaluated effectively around edges in images/video frames, since it is most noticeable there, and such detection is efficient due to only a small fraction of image pixels on edges. With an available reference signal, the extent of blurring can be estimated via contrast decrease on edges [53]. Various blind methods without a reference signal have been proposed for measuring the blurring/sharpness, such as edge spread detection [26, 40], kurtosis [146], frequency domain analysis [143, 197], PSF estimation [198], width/amplitude of lines and edges [39], and local contrast via 2D analytic filters [199].

3.3.2.4 Motion and Jerkiness

For visual quality evaluation of coded video, the major temporal distortion is jerkiness, which is mainly caused by frame dropping [200] and is very annoying to viewers, who prefer continuous and smooth temporal transitions. For decoded video without availability of the coding parameters, frame freeze can be simply detected by frame differences [201]; in the case when the frame rate is unavailable, the jerkiness effect can be evaluated using the frame rate [156, 183] or more comprehensively, both the frame rate and temporal activity such as motion [202]. In [203], inter-frame correlation analysis is used to estimate the location, number, and duration of lost frames. In [154, 200], lost frames are detected by inter-frame dissimilarity to measure fluidity; these studies conclude that, for the same level of frame loss, scattered fluidity breaks introduce less quality degradation than aggregated ones. The impact of the time interval between occurrences of significant visual artifacts has also been investigated [204].

3.3.3 Just-Noticeable Difference (JND) Modeling

As mentioned previously, not every signal change is noticeable. JND refers to a visibility or audibility threshold below which a change cannot be detected by the majority of viewers [201, 205–212]. Obviously, if a difference is below the JND value, it can be ignored in quality evaluation.

For images and video, DCT-based JND is the most investigated topic among all sub-band-based JND functions, since DCT has been used in all existing image/video compression standards such as JPEG, H.261/3/4, MPEG-1/2/4, and SVC. A general form of the DCT-sub-band luminance JND function is introduced in [52, 201, 207]. The widely used JND function developed by Ahumada and Peterson [211] for the base-line threshold fits spatial CSF curves with a parabola equation, which is a function of spatial frequencies and background luminance, and then compensates for the fact that the psychophysical experiments for determining CSF were conducted with a single signal at a time, and with spatial frequencies along just one direction. The luminance adaptation factor has been determined to represent the variation versus background luminance [205], to be more consistent with the findings of subjective viewing of digital images [206, 212]. The intra-band masking effect was investigated in [52, 208]. Inter-band masking effects can be assigned as low, medium, or high masking after classifying DCT blocks into smooth, edge, and texture ones [205, 207], according to energy distribution among sub-bands. For temporal CSF effects, the velocity perceived by the retina for an image block needs to be estimated [209]. A method for incorporating the effect of the velocity for temporal CSF in JND is given in [210].

The JND can also be defined in other frequency bands (e.g., Laplacian pyramid image decomposition [210], Discrete Wavelet Transform (DWT) [213]). In comparison with DCT-based JND, significantly more research is needed for DWT-based JND. DWT is a popular alternative transform, and more importantly, is similar to the HVS in its multiple sub-channel structure and frequency-varying resolution. Chrominance masking [214] still needs more convincing investigation for all sub-band domains.

There are situations where JND estimated from pixels is more convenient and efficient to use (e.g., motion search [168], video replenishment [215], filtering of motion-compensated residuals [165], and edge enhancement [48, 216]), since the operations are usually performed on pixels rather than sub-bands. For quality evaluation of images and video, pixel-domain JND models avoid unnecessary sub-band decomposition. Most pixel-based JND functions developed so far have used luminance adaptation and texture masking. A general pixel-based JND model can be found in [216]. The temporal effect was addressed in [217] by multiplying the spatial effect with an elevation parameter increasing with inter-frame changes. The major shortcoming of pixel-based JND modeling lies in the difficulty of incorporating CSF explicitly, except for the case with conversion from a sub-band domain [218].

For audio/speech signals, there are some types of JND based on different components of the signals – such as amplitude, frequency, etc. [219]. For the amplitude of audio/speech signals, the JND for the average listener is about 1 dB [130, 220], while the frequency JND for the average listener is approximately 1 Hz for frequencies below 500 Hz and about f/500 for frequencies f above 500 Hz [221]. The ability to discriminate audio/speech signals in the temporal dimension is another important characteristic of the acuity of the HAS. The duration of the gap between two successive audio/speech signals must be at least 4–6 ms in length to be detected correctly [222]. The ability of the HAS to detect changes over time in the amplitude and frequency of audio/speech signals is dependent on the rate of change and amount of change in amplitude and frequency [131].

3.3.4 Attention Modeling

Not every part of a multimedia presentation receives the same attention (the second problem of signal fidelity metrics mentioned in the introduction). This is due to the fact that human perception selects a part of the signal for detailed analysis and then responds. VA refers to the selective awareness/responsiveness to visual stimuli [223], as a consequence of human evolution.

There are two types of cue that direct attention to a particular point in an image [224]: bottom-up cues that are determined by external stimuli, and top-down cues that are caused by a voluntary shift in attention (e.g., when the subject is given prior information/instruction to direct attention to a specific location/object). The VA process can be regarded as two stages [225]: in the pre-attention stage, all information is processed across the entire visual field; in the attention stage, the features may be bound together (feature integration [226], especially for a bottom-up process), or the dominant feature is selected [227] (for a top-down process).

Most existing computational VA models are bottom-up (i.e., based on contrast evaluation of various low-level features in images, in order to determine which locations stand out from their surroundings). As to the top-down (or task-oriented) attention, there is still a need for more focused research, although some initial work has been done [228, 229].

An influential bottom-up VA computational model was proposed by Itti et al. [230] for still images. An image is first low-pass filtered and down-sampled progressively from scale 0 (the original image size) to scale 8 (1:256 along each dimension). This is to facilitate the calculation of feature contrast, which is defined as

where F represents the map for one of the image features as follows: intensity, color, and orientation; e ∈ {2, 3, 4} and F(e) denote the feature map at scale e; q = e + δ, with δ ∈ {3, 4}, and Fl(q) is the interpolation to the finer scale e from the coarser scale q. In essence, evaluates pixel-by-pixel contrast for a feature, since F(e) represents the local information, while Fl(q) approximates the surroundings.

With one intensity channel, two color channels, and four orientation channels (0°, 45°, 90°, 135°; detected by Gabor filters), 42 feature maps are computed: 6 for intensity, 12 for color, and 24 for orientation. After cross-scale combination and normalization, the winner-takes-all strategy identifies the most interesting location on the map. There are various other approaches for visual attention modeling [231–234].

The VA map along the temporal dimension (over multiple consecutive video frames) can also be estimated. In the scheme proposed in [194] for video, different features (such as color, texture, motion, human skin/face) were detected and integrated for the continuous (rather than the winner-takes-all) salience map. In [235], auditory attention was also considered and integrated with visual factors. This was done by evaluating sound loudness and its sudden change, and a Support Vector Machine (SVM) was employed to classify each audio segment into speech, music, silence, and other sounds; the ratio of speech/music to other sounds was measured for saliency detection. Recently, Fang et al. proposed saliency detection models for images and videos based on DCT coefficients (with motion vectors for video) in the compressed domain [233, 236]. These models can be combined with quality metrics obtained from DCT coefficients and motion vectors for visual quality evaluation. A detailed overview and discussion of visual attention models can be found in a recent survey paper [232].

Contrast sensitivity reaches its maximum at the fovea and decreases toward the peripheral retina. The JND model represents the visibility threshold when the attention is there. In other words, JND and VA account for the local and global responses of the HVS in appreciating an image, respectively. The overall visual sensitivity at a location in the image could be JND modulated by the VA map [194]. Alternatively, the overall visual sensitivity may be derived by modifying the JND at every location according to its eccentricity away from the foveal points, with the foveation model in [237].

VA modeling is generally easier for video than still images. If an observer has enough time to view an image, many points of the image will be attended to eventually. The perception of video is different: every video frame is displayed to an observer for a very short time interval. Furthermore, camera and/or object motion may guide the viewer's eye movements and attention.

Compared with visual content, there is much less research on auditory attention modeling. Currently, there are several studies proposing auditory attention models for audio signals [238–240]. Motivated by the formation of auditory streams, the study [239] designs a conceptual framework of auditory attention, which is implemented as a computational model composed of a network of neural oscillators. Inspired by the successful visual saliency detection model proposed by Itti et al. [230], some other auditory attention models are proposed using feature contrast – such as frequency contrast, temporal contrast, etc. [238, 240].

3.4 Quality Metrics for Images

The early image quality metrics are traditional signal fidelity metrics, which include MAE, MSE, SNR, PSNR, etc. As discussed previously, these metrics cannot predict image distortions as they are perceived. To address the drawback of traditional signal fidelity metrics, various perceptual-based image quality metrics have been proposed in the past decades [7, 11, 12, 24, 33–36]. These are classified and introduced in the following.

3.4.1 2D Image Quality Metrics

3.4.1.1 FR Metrics

Early perceptual image quality metrics were developed based on simple and systematic modeling of relevant psychophysical or physiological properties. Mannos and Sakrison [10] proposed a visual fidelity measure based on CSF for images. Faugeras [42] introduced a simple model of human color vision based on experimental evidence for image evaluation. Another early FR and multichannel model is the Visible Differences Predictor (VDP) of Daly [19], where the HVS model accounts for sensitivity variations due to luminance adaptation, spatial CSF, and contrast masking. The cortex transform is performed for signal decomposition, and different orientations are distinguished. Most existing schemes in this category follow a similar methodology, with differences in the color space adopted, the type of spatiotemporal decomposition, or the error pooling methods. In the JNDmetrix model [20], the Gaussian pyramid [175] was used for decomposition, with luminance and chrominance components in the image. Liu et al. [48] proposed a JND model to measure the visual quality of images. The perceptual effect can be derived by considering inter-channel masking [46]. Other similar algorithms using CSF and visual masking are described in [25, 241].

Recently, various perceptual image quality metrics have been proposed using signal modeling or processing of visual signals under consideration, which incorporate certain specific knowledge (such as the specific distortion [21]). This approach is relatively less sophisticated and therefore less computationally expensive. In [53], the image distortion is measured by the DCT coefficient differences weighted by JND. Similarly, the well-cited SSIM (Structural SIMilarity) was proposed by Wang and Bovik based on the sensitivity of the HVS to image structure [36, 54, 242]. SSIM can be calculated as

where a and b represent the original and test images; and are their corresponding means, σa and σb are the corresponding standard deviations; σab is the cross covariance. The three terms in equation (3.7) measure the loss of correlation, contrast distortion, and luminance distortion, respectively. The dynamic range of the SSIM value Q is [−1, 1], with the best value of 1 when a = b.

Studies show that SSIM bears a certain relationship with MSE and PSNR [55, 243]. In [243], PSNR and SSIM are compared by their analytical formulas. The analysis shows that there is a simple logarithmic link between them for several common degradations, including Gaussian blur, additive Gaussian noise, JPEG and JPEG2000 compression [243]. PSNR and SSIM can be considered as closely related quality metrics, with differences in the degree of sensitivity to some image degradations.

Another method for feature detection with consideration of structural information is Singular Value Decomposition (SVD) [111]. With more theoretical background, the Visual Information Fidelity (VIF) [35] (an extension of the study [34]) is proposed based on the assumption that the Random Field (RF) from a sub-band of the test image, D, can be expressed as

where U denotes the RF from the corresponding sub-band of the reference image, G is a deterministic scale gain field, and V is a stationary additive zero-mean Gaussian noise RF. The proposed model takes into account additive noise and blur distortion; it is argued that most distortion types prevalent in real-world systems can be roughly described locally by a combination of these two. The resultant metric measures the amount of information that can be extracted about the reference image from the test. In other words, the amount of information lost from a reference image as a result of distortion gives the loss of visual quality.

Another image quality metric with good theoretical foundations is the Visual Signal-to-Noise Ratio (VSNR) [33], which operates in two stages. In the first stage, the contrast threshold for distortion detection in the presence of the image is computed via wavelet-based models of visual masking and visual summation, in order to determine whether the distortion in the test image is visible. If the distortion is below the threshold of detection, the test image is deemed to be of perfect visual fidelity (VSNR = ∞). If the distortion is above the threshold, a second stage is applied, which operates based on the property of perceived contrast, and the mid-level visual property of global precedence. These two properties are modeled as Euclidean distances in distortion-contrast space of a multiscale wavelet decomposition, and VSNR is computed based on a simple linear sum of these distances.

Larson and Chandler [187] proposed a perceptual image quality metric called the “most apparent distortion” based on two separate strategies. Local luminance and contrast masking are adopted to estimate detection-based perceived distortion in high-quality images, while changes in the local statistics of spatial-frequency components are used to estimate appearance-based perceived distortion in low-quality images [187]. Recently, some new image quality metrics have been proposed using new concepts or methods [111, 169, 188, 189, 244]. Wu et al. [169] adopted the concept of Internal Generative Mechanism (IGM) theory to divide image regions into two different parts of predicted portion and disorderly portion, which are measured by the structural similarity and PNSR metrics, respectively. Liu et al. [244] used gradient similarity to measure the change in contrast and structure. A recent emerging scheme for an image quality metric is based on machine learning techniques [111, 188, 189].

3.4.1.2 RR Metrics

Some RR metrics for images are designed based on the properties of the HVS. In [88], several factors of the HVS – including CSF, psychophysical sub-band decomposition, and masking effect modeling – are adopted to design an RR quality metric for images. The study in [91] proposes an RR quality metric for wireless imaging based on the observation that HVS is trained to extract structural information from the viewing area. In [94], an RR quality metric is designed based on the wavelet transform, which is used for extracting features to simulate the psychological mechanisms of HVS. The study in [95] adopts the phase and magnitude of the 2D discrete Fourier transform to build an RR quality metric, which is motivated by the fact that the sensitivity of the HVS is frequency dependent. Recently, Zhai et al. [245] used the free-energy principle from cognitive processing to develop a psychovisual RR quality metric for images.

In RR metrics for images, various features can be extracted to measure the visual quality. The image distortion of some RR metrics is calculated based on features extracted from the spatial domain – such as color correlograms [99], image statistics in the gradient domain [101], structural information [77, 86], etc. Other RR metrics are proposed using features extracted in the transform domain – such as wavelet coefficients [98, 105, 106], coefficients from divisive normalization transform [104], DCT coefficients [102, 108], etc.

Some RR metrics are designed for specific distortion types. A hybrid image quality metric is designed by fusing several existing techniques to measure five specific artifacts in [79]. Other RR metrics are proposed to measure the distortion from JPEG compression [83], distributed source coding [96], etc.

3.4.1.3 NR Metrics

NR quality metrics for images are proposed based on various features or specific distortion types. Many studies compute edge-extent features of images to build their NR quality metrics [136, 138, 140]. The natural scene statistics of DCT coefficients is used to measure the visual quality of images in [116]. In that metric, a Laplace probability density function is adopted to model the distribution of DCT coefficients. DCT coefficients are also used to measure blur artifacts [143], blockiness artifacts [134, 135], and so on. Similarly, some NR quality metrics for images use the features extracted by Fourier transform to measure different types of artifact – such as blur artifact [145], blockiness artifact [24], etc. Besides, NR quality metrics can be designed based on other features – such as sharpness [137], ringing [146], naturalness [148], and color [149]. In some NR quality metrics, blockiness, blurring, or ringing features are combined with other features, such as the bit-stream feature [151], edge gradient [152], and so on. The noise-estimation-based NR image quality metrics calculate MSE based on the difference between the proposed signal and the smoothing signal [126], or the variation within certain smooth regions in visual signals [129].

3.4.2 3D Image Quality Metrics

Compared with visual quality metrics for 2D images, quality metrics for 3D images have to consider additionally the depth perception. The HVS uses a multitude of depth cues, which can be classified into oculomotor cues coming from the eye muscles, and visual cues from the scene content itself [163, 246, 247]. The oculomotor cues include the factors of accommodation and vergence [246]. Accommodation refers to the variation of the lens shape and thickness, which allows the eyes to focus on an object at a certain distance, while vergence refers to the muscular rotation of the eyeballs, which is used to converge both eyes on the same object. There are two types of visual cue, namely monocular and binocular [246]. Monocular visual cues include relative size, familiar size, texture gradients, perspective, occlusion, atmospheric blur, lighting, shading, and shadows, motion parallax, etc. The most important binocular visual cue is the retinal disparity between points of the same objects viewed from slightly different angles by the eyes, which is used in stereoscopic 3D systems such as 3DTV.

Although 3D image quality evaluation is a challenging problem due to the complexities of depth perception, a number of 3D image quality metrics have been proposed [163]. Most existing 3D image quality metrics evaluate the distortion of 3D images by combining the evaluation results of a 2D image pair and additional factors – such as depth perception, visual comfort, and other visual experiences. In [248], 2D image quality metrics are combined with disparity information to predict the visual quality of 3D compressed images with blurring distortion. Similarly, [249] integrates the disparity information with 2D quality metrics for quality evaluation for 3D images. In [250], a 3D image quality metric is designed based on absolute disparity information. Existing studies also explore the visual quality assessment for 3D images based on characteristics of the HVS – such as contrast sensitivity [251], viewing experience [252], binocular visual characteristics [253], etc.

Furthermore, there are several NR metrics proposed for 3D image quality evaluation. In [254], an NR 3D image quality metric is built for JPEG-coded stereoscopic images based on segmented local features of artifacts and disparity. Another NR quality metric for 3D image quality assessment is based on the nonlinear additive model, ocular dominance model, and saliency-based parallax compensation [141].

3.5 Quality Metrics for Video

The history and development of video quality metrics shares many similarities with image quality metrics, with the additional consideration of temporal effects.

3.5.1 2D Video Quality Metrics

3.5.1.1 FR Metrics

As stated previously, there are two types of perceptual visual quality metrics: vision-based and signal-driven [2, 255–269]. Vision-based approaches include for example [22, 255], where HVS-based visual quality metrics for coded video sequences are proposed based on contrast sensitivity and contrast masking. The study in [47] proposes a JND-based metric to measure the visual quality for video sequences. In [263], a MOtion-based Video Integrity Evaluation (MOVIE) metric is proposed based on characteristics of the Middle Temporal (MT) visual area of the human visual cortex for video quality evaluation. In [266], an FR quality metric is proposed to improve the video evaluation performance of the quality metrics in [264, 265]. Vision-based approaches typically measure the distortion of the processed video signal in the spatial domain [49], DCT domain [21], or wavelet domain [50, 262].

Signal-driven video quality metrics are based primarily on the analysis of specific features or artifacts in video sequences. In [256], Wang et al. propose a video structural similarity index based on SSIM to predict the visual quality of video sequences. In that study, an SSIM-based video quality metric evaluates the visual quality of video sequences from three levels: the local region level, the frame level, and the sequence level. A similar metric for visual quality evaluation is proposed in [257]. In [258], another SSIM-based video quality metric is designed based on a statistical model of human visual speed perception described in [259]. In [260], an FR video quality metric is proposed based on singular value decomposition. In [51], the video quality metric software tools provide standardized methods to measure the perceived quality of video systems. With the general model in that study, the main distortion types include blurring, blockiness, jerky/unnatural motion, noise in luminance and chrominance channels, and error blocks. In [261], a video quality metric is proposed based on the correlation between subjective (MOS) and objective (MSE) results. In [270], a low-complexity video quality metric is proposed based on temporal quality variations. Some existing metrics make use of both classes (vision-based and signal-driven). The metric proposed in [271] combines model-based and signal-driven methods based on the extent of blockiness in decoded video. A model-based metric was applied to blockiness-dominant areas in [29], with the help of a signal-driven measure.

Recently, various High-Definition Television (HDTV) videos have emerged, with large demand from users and increased development of high-speed wideband network services. Compared with Standard-Definition Television (SDTV) videos, HDTV content needs higher-resolution display screens. Although the viewing distance of HDTV systems is closer in terms of image height compared with SDTV systems, approximately the same number of pixels per degree of viewing angle exists in both these systems due to the higher spatial resolution in HDTV [267]. However, the higher total horizontal viewing angle for HDTV (about 30°) may influence the quality decisions compared with that for SDTV (about 12°) [267]. Additionally, with high-resolution display screens for HDTV, human eyes roam the picture in order to track specific objects and their motion, which causes the visual distortion outside the immediate area of attention to be perceived less compared with SDTV [267]. With emerging HDTV applications, some objective quality metrics have been proposed to evaluate the visual quality of HDTV specifically. The study in [267] conducts experiments to assess whether the NTIA general video quality metric [51] can be used to measure the visual quality of HDTV video. In [268], spatiotemporal features are extracted from visual signals to estimate the perceived quality of distortion caused by compression coding. In [269], an FR objective quality metric is proposed based on a fuzzy measure to evaluate coding distortion in HDTV content. Recently, some studies have investigated the visual quality for Ultra-High Definition (UHD) video sequences by subjective experiments [272–274]. The study in [274] conducts subjective experiments to analyze the performance of the popular objective quality metrics (PSNR, VSNR, SSIM, MS-SSIM, VIF, and VQM) on 4K UHD video sequences; experimental results show the content-dependent nature of most objective metrics, except VIF [274]. The subjective experiments in [273] demonstrate that the HEVC-encoded YUV420 4K-UHD video at a bit rate of 18 Mb/s has good visual quality in the usage of legacy DTV broadcasting systems with single-channel bandwidths of 6 MHz. The study in [272] presents a set of 15 4K UHD video sequences for the requirements of visual quality assessment in the research community.

3.5.1.2 RR Metrics

Video is a much more suitable application for RR metrics than images because of the streaming nature of the content and the much higher data rates involved. Typically, low-level spatiotemporal features from the original video are extracted as reference. Features from the reference video can then be compared with those from the processed video.

In the work performed by Wolf and Pinson [80], both spatial and temporal luminance gradients are computed to represent the contrast, motion, amount, and orientation of activity. Temporal gradients due to motion facilitate detecting and quantifying related impairments (e.g., jerkiness) using the time history of temporal features. The metric performs well in the VQEG FR-TV Phase II Test [275].

Some RR metrics for video are proposed based on specific features or properties of HVS in the spatial domain. In [87], an RR quality metric is designed based on a psychovisual color space from high-level human visual behavior for color video. The RR quality metric for video signals also takes advantage of contrast sensitivity [89]. The study in [92] incorporates the texture-masking property of the HVS. In [93], an RR quality metric is proposed based on features of SVD and HVS for wireless applications. An RR quality metric is proposed in [100], based on temporal motion smoothness. In [103], RR video quality metrics are proposed based on SSIM features.

Other RR metrics for video work directly in the encoded domain. In [81], blurring and blockiness from video compression are measured by a discriminative analysis of harmonic strength extracted from edge-detected images. In [80], RR quality metrics are proposed based on spatial and temporal features to measure the distortion occurring in standard video compression and communication systems. DCT coefficients are used to extract features for the perceptual quality evaluation of MPEG2-coded video in [82]. In [84], an RR quality metric is proposed based on multivariate data analysis to measure the artifacts of H.264/AVC video sequences. The RR quality metric in [107] also extracts features from DCT coefficients to measure the quality of distorted video. In [97], the differences between entropies of the wavelet coefficients of the reference and distorted video are calculated to measure the distortion of video signals.

3.5.1.3 NR Metrics

For NR video quality measurement, many studies build their metrics based on direct estimation of MSE or PSNR caused by specific block-based compression standards such as MPEG2, H.264, etc. [112, 114, 117, 118, 120]. In [112], the PSNR is calculated from the estimated quantization error caused by compression for visual quality evaluation. The study in [114] estimates PSNR based on DCT coefficients of MPEG2 video for visual quality evaluation. The transform coefficients are modeled by different distributions for visual quality evaluation such as a Gaussian model [118], Laplace model [117], and Cauchy distribution [120]. Some NR quality metrics have tried to measure the MSE caused by packet-loss errors [121, 124]. Bit-stream-based approaches predict the quality of video from the compressed video stream with packet losses [121]. The NR quality metric in [124] is designed to detect packet loss caused by specific compression of H.264 and motion-JPEG 2000, respectively. The noise-estimation-based NR quality metrics calculate the MSE based on the variation within certain smooth regions in visual signals [129]. Other NR quality metrics incorporate the characteristics of HVS for measuring the quality of visual content [139].

Beside the direct estimation of the MSE, some feature-based NR video quality metrics have been proposed for video quality assessment. The features in NR quality metrics for video signals can be extracted from different domains. In [27, 132], a blockiness artifact is measured based on features extracted from the spatial domain, while [134] evaluates visual quality for video based on features in the DCT domain. Additionally, in NR video quality metrics, various types of feature are used to calculate the distortion in video signals. In [152], the edge feature is extracted from visual signals to build NR quality metrics. In the NR quality metrics of [143, 144], the blurring feature is extracted from DCT coefficients. Some studies propose NR video quality metrics by combining blockiness, blurring, and ringing features together [150, 151]. Beside, some specific NR quality metrics for video are proposed to measure flicker [153] or frame freezes [154, 156]. In [281], an NR quality metric for HDTV is proposed to evaluate blockiness and blur distortions.

3.5.2 3D Video Quality Metrics

Recently, some studies have investigated quality metrics for the emerging applications of 3D video processing. The experimental results of the studies in [276, 277] show that the 2D quality metrics can be used to evaluate the quality of 3D video content. The study in [190] discusses the importance of visual attention in 3DTV quality assessment. In [278], an FR stereo-video quality metric is proposed based on a monoscopic quality component and stereoscopic quality component. A 3D video quality metric is proposed based on the spatiotemporal structural information extracted from adjacent frames in [279]. Some studies also use characteristics of the HVS – including CSF, visual masking, and depth perception – to build perceptual 3D video quality metrics [28, 280]. Beside FR quality metrics, RR and NR quality metrics for 3D video quality evaluation have also been investigated in [78] and [133], respectively. However, 3D video quality measurement is still an open research area, because of the complexities of depth perception [246, 282].

3.6 Quality Metrics for Audio/Speech

Just like for images and video, traditional objective signal measures used for audio/speech quality assessment are built on basic mathematical measurements such as SNR, MSE, etc. They do not take psychoacoustic features of the HAS into consideration and thus cannot provide satisfactory performance compared with perceptual audio/speech quality assessment methods. Additionally, the shortcomings of traditional objective signal measures are evident in the non-linear and non-stationary codecs for audio/speech signals [3]. To overcome these drawbacks, various perceptual-based objective quality evaluation algorithms have been proposed based on characteristics of the HAS such as the perception of loudness, frequency, masking, etc. [3, 62, 177]. The amplitude of audio signals refers to the amplitude of the air pressure in the audio wave. The loudness is related to the amplitude of audio signals, which is perceived by listeners as the audio pressure level. The frequency of audio signals is measured in cycles per second (or Hz), and humans can perceive audio signals with frequencies in the range of 20 Hz to 20 kHz. Generally, the sensitivity of the HAS is frequency dependent. Auditory masking happens when the perception of one audio signal is affected by another audio signal. In the frequency domain, auditory masking is known as simultaneous masking, while it is known as temporal masking in the time domain.

Currently, most existing FR models (also called intrusive models) for audio/speech signals adopt perceptual models to transform both reference and distorted signals for feature extraction [63–66, 71, 178]. The quality of the distorted signal is estimated from the distance between features of the reference and distorted signals in transform domains. The NR models (also called non-intrusive models) estimate the quality of distorted speech signals without reference signals. Currently, there is no NR model for audio signals. Existing NR models for speech signals calculate the distortion results based on the signal production, signal likelihood, perception properties of noise loudness, etc. [62, 155, 283].

We are not aware of any RR metrics for audio/speech quality assessment in the literature.

3.6.1 FR Metrics

Studies of FR audio/speech quality metrics began with the requirement of low-bit-rate speech and audio codecs [62]. Since the 1970s, many studies have adopted perception-based models in speech/audio codecs to optimize the coding distortions for minimum audibility rather than MSE for improved perceived quality [30]. In [63], a noise-to-mask ratio measure was designed based on a perceptual masking model by comparing the level of the coding noise with the reference signal. Other similar waveform difference measures include the research work in [64, 65]. The problem with these methods is that the estimated quality might be unreasonable for the distorted signals with substantial changes of signal waveform, which would result in large waveform differences [62]. To overcome this problem, researchers have tried to extract signal features in the transform domain, which is consistent with the hypothetical representation of the signal in the brain or peripheral auditory system. One successful approach is the auditory spectrum distance model [66], which is widely used in ITU standards [65, 71, 178]. In that model [66], the features of peripheral hearing in the time and frequency domain are extracted for quality evaluation based on psychoacoustic theory. The study in [67] adopts a model of HAS to calculate the internal representation of audio signals for quality evaluation based on the psychophysical domain. In these models, the time signals are first mapped into the time frequency domain, and then smeared and compressed to get two time-frequency loudness density functions [62]. These density functions are passed to a cognitive model interpreting their differences with possible substantial additional processing [62]. Generally, the cognitive model is trained by a large training database and should be validated by test data. These perceptual quality metrics show promising prediction performance for many aspects of psychoacoustic data due to the use of psychoacoustic theories [66, 67]. Other studies try to improve the performance of existing metrics by using more detailed or advanced HAS models [17, 68–70, 73].

For audio quality assessment, the study in [284] calculates the probability of detected noise as a function of time for the coded audio signals. The study in [55] develops a model of the human ear in perceptual coding of audio signals. A frequency response equalization process is used in [179] for the quality assessment of audio signals. The study in [56] proposes an advanced quality metric based on a wide range of perceptual transformations. Some studies have tried to predict the perceived quality of audio signals based on the estimation of frontal spatial fidelity and surround spatial fidelity of multichannel audio [76], new distortion parameters and a cognitive model [57], and a multichannel expert system [58]. Recently, several perceptual objective metrics for audio signals have been proposed using an energy equalization approach [59, 60] and mean structural similarity measure [18].

Speech quality assessment has an even longer history, with many metrics [61, 64, 66, 68, 69, 73–75, 285]. One early perceptual speech quality metric was proposed by Karjalainen based on the features of peripheral hearing in time and frequency known from psychoacoustic theory [66]. Later, a simple approach known as the Perceptual Speech Quality Measure (PSQM) was proposed for the standard ITU-T P.861 [65]. Different from earlier models, PSQM improved its salient interval processing, giving less emphasis to noise in silent periods than during speech, and its use of asymmetry weighting [62]. The drawback of PSQM and other early models is that they are trained on subjective tests of generic speech codecs, and thus their performance is poor with some types of telephone network [62]. To address this problem, some objective metrics for speech signals were proposed with specific telephone network conditions [74, 285, 286]. Several more recent FR speech quality metrics have been proposed based on Bayesian modeling [31], adaptive feedback canceller [32], etc.

3.6.2 NR Metrics

NR speech quality evaluation is more challenging due to the lack of reference signals. However, NR models are much more useful in practical applications such as wireless communications, voice over IP, and other in-service networks requiring speech quality monitoring, where the reference signal is unavailable. Currently, there is no NR quality metric for general audio signals. There are some studies trying to propose NR quality metrics for speech signals based on specific features.

Several NR speech quality metrics are designed based on specific distortions introduced by standard codecs or specific transmission networks. An early NR speech quality evaluation metric is built based on the spectrogram of the perceived signal for wireless communication [113]. The speech quality metric in [115] adopts Gaussian Mixture Models (GMMs) to create an artificial reference model to compare the degraded speech for quality evaluation; whereas in [119] speech quality is predicted by Bayesian inference, and Minimum Mean Square Error (MMSE) estimation based on a trained set of GMMs. In [131], a perceptually motivated speech quality metric is presented based on a temporal envelope representation of speech. The study [123] proposes a low-complexity NR speech quality metric based on features extracted from commonly used speech coding parameters (e.g., spectral dynamics). The features are extracted globally and locally to design an NR speech quality metric in [109]. Machine learning techniques are adopted to predict the quality of distorted speech signals in [128].

There are also other studies developing NR quality metrics that assess the quality of noise-suppressed speech signals. An NR speech quality metric is proposed in [127] based on Kullback–Leibler distances for noise-suppressed speech signals. In [125], an NR speech quality metric is built for noise-suppressed speech signals based on mel-filtered energies and support vector regression.

3.7 Joint Audiovisual Quality Metrics

Generally, we watch video with an accompanying soundtrack. Therefore, comprehensive audiovisual quality metrics are required to analyze both modalities of multimedia content together. Audiovisual quality comprises two factors: synchronization between the two media signals (i.e., lip-sync) and interaction between audio and video quality [5, 44]. Currently, various research studies have been performed for audio/video synchronization. In actual lip-sync experiments, viewers perceive audio and video signals to be in sync up to about 80 ms of delay [287]. There is a consistently higher tolerance for video ahead of audio rather than vice versa, probably since this is also a more natural occurrence in the real world, where light travels faster than sound. Similar results were reported in experiments with non-speech clips showing a drummer [288]. The interaction between audio and video signals is another factor influencing the overall quality assessment of multimedia content, as shown by studies from neuroscience [289]. In [289], Lipscomb claims that at least two implicit judgments are made during the perceptual processing of the video experience: an association judgment and a mapping of accent structures. Based on the experimental results, the importance of synchronization decreases with more complicated audiovisual content for the interaction effect from audio and video signals [289].

Since most existing audiovisual quality metrics are proposed based on a combination of audio and video quality evaluation, the study in [44] analyzes the mutual influence between audio quality, video quality, and audiovisual quality. Based on the experimental analysis, the study obtains several general conclusions as follows. Firstly, both audio quality and video quality contribute to the overall audiovisual quality and their multiplication gets the highest correlation with the overall quality. Secondly, the overall quality is dominated by the video quality in general, whereas audio quality is more important than video quality in cases where the bit rates of both coded audio and video are low, or the video quality is larger than some certain threshold. With decreasing audio quality, the influence of audio quality increases in the overall quality. Additionally, with applications in which audio is obviously more important than video content (such as teleconference, news, music video, etc.), audio quality dominates the overall quality. Finally, audiovisual quality is also influenced by other factors, including motion information and complexity of the video content [44].

In [290], subjective experiments were carried out on audio, video, and audiovisual quality with results demonstrating that both audio and video quality contribute significantly to perceived audiovisual quality. The study also shows that the audiovisual quality can be evaluated with high accuracy by linear or bilinear combination from audio and video quality evaluation. Thus, many studies have adopted linear combination from audio and video quality evaluation to evaluate the quality of audio/video signals [291, 292].

Studies on audio/video quality metrics have focused mainly on low-bit-rate applications such as mobile communications, where the audio stream can use up a significance part of the total bit rate [293, 294]. Audio/video synchronization is incorporated, beside the fusion of audio and video quality in the audiovisual model proposed in [295]. Some studies focus on audiovisual quality evaluation for video conference applications [291, 292, 296]. The study in [297] presents a basic audiovisual quality metric based on subjective experiments on multimedia signals with simulated artifacts. The test data used in these studies is quite different in terms of content range and distortion, and these models obtain good prediction performance. In [142], an NR audiovisual quality metric is proposed to predict audiovisual quality and obtain good prediction performance. The study in [298] presents a graph-based perceptual audiovisual quality metric based on the contributions from modalities (audio and video) as well as the contribution of their relation. Some studies propose an audiovisual quality metric based on semantic analysis [299, 300].

Although there are some studies investigating audiovisual quality metrics, the progress of joint audiovisual quality assessment has been slow. The interaction between audio and video perception is complicated, and the perception of audiovisual content still lacks deep investigation. Currently, there are many quality metrics proposed based on the linear fusion of audio and video quality, but most studies choose fusion parameters empirically without theoretical support and little if any integration in the metric computation. However, audiovisual quality assessment is worthy of further investigation due to its wide application in signal coding, signal transmission, etc.

3.8 Concluding Remarks

Currently, traditional signal fidelity metrics are still widely used to evaluate the quality of multimedia content. However, perceptual quality metrics have shown promise in quality assessment, and a large number of perceptual quality assessment metrics have been proposed for various types of content, as introduced in this chapter. During the past ten years, some perceptual quality metrics have gained popularity and have been used in various signal-processing applications, such as SSIM. In the past, a lot of effort focused on designing FR metrics for audio or video. It is not easy to obtain good evaluation performance with RR or NR quality metrics. However, effective NR metrics are much desired, with more and more multimedia content (such as image, video, or music files) being distributed over the Internet today. The widely used Internet transmission and new compression standards bring many new challenges for multimedia quality evaluation, such as new types of transmission loss and compression distortions. Additionally, various emerging applications of 3D systems and displays require new quality metrics. Depth perception in particular should be investigated further for 3D quality evaluation. Other substantial quality evaluation topics include the quality assessment for super-resolution images/video and High Dynamic Range (HDR) images/video. All these emerging content types and their corresponding processing methods bring with them many challenges for multimedia quality evaluation.

References

  1. Chikkerur, S., Sundaram, V., Reisslein, M., and Karam, L.J., ‘Objective video quality assessment methods: A classfication, review, and performance comparison.’ IEEE Transactions on Broadcasting, 57(2), 2011, 165–182.
  2. Lin, W. and Kuo, C.C.J., ‘Perceptual visual quality metrics: A survey.’ Journal of Visual Communication and Image Representation, 22(4), 2011, 297–312.
  3. Campbell, D., Jones, E., and Glavin, M., ‘Audio quality assessment techniques – a review and recent developments.’ Signal Processing, 89(8), 2009, 1489–1500.
  4. Winkler, S., Digital Video Quality – Vision Models and Metrics. John Wiley & Sons, Chichester, 2005.
  5. Winkler, S. and Mohandas, P., ‘The evolution of video quality measurement: From PSNR to hybrid metrics.’ IEEE Transactions on Broadcasting, 54(3), 2008, 660–668.
  6. Eskicioglu, A.M. and Fisher, P.S., ‘Image quality measures and their performance.’ IEEE Transactions on Communications, 43(12), 1995, 2959–2965.
  7. Karunasekera, S.A. and Kingsbury, N.G., ‘A distortion measure for blocking artifacts in images based on human visual sensitivity.’IEEE Transactions on Image Processing, 4(6), 1995, 713–724.
  8. Limb, J.O., ‘Distortion criteria of the human viewer.’ IEEE Transactions on Systems, Man, and Cybernetics, 9(12), 1979, 778–793.
  9. Girod, B., ‘What's wrong with mean squared error?’ In Watson, A.B. (ed.), Digital Images and Human Vision. MIT Press, Boston, MA, 1993, pp. 207–220.
  10. Mannos, J. and Sakrison, D., ‘The effects of a visual fidelity criterion of the encoding of images.’ IEEE Transactions on Information Theory, 20(4), 1974, 525–536.
  11. Wang, Z. and Bovik, A.C., ‘Mean squared error: Love it or leave it? A new look at fidelity measures.’ IEEE Signal Processing Magazine, 26(1), 2009, 98–117.
  12. Eckert, M.P. and Bradley, A.P., ‘Perceptual quality metrics applied to still image compression.’ Signal Processing, 70, 1998, 177–200.
  13. Pappas, T.N. and Safranek, R.J., ‘Perceptual criteria for image quality evaluation.’ In Bovik, A.C. (ed.), Handbook of Image and Video Processing. Academic Press, New York, 2000, pp. 669–684.
  14. Video Quality Expert Group (VQEG), Final report from the video quality expert group on the validation of objective models of video quality assessment, March 2000. Available at: www.vqeg.org.
  15. Video Quality Expert Group (VQEG), Final report from the video quality expert group on the validation of objective models of video quality assessment, Phase II, August 2003. Available at: www.vqeg.org.
  16. Wang, Z., Bovik, A.C., and Lu, L., ‘Why is image quality assessment so difficult?’ IEEE International Conference on Acoustics, Speech, and Signal Processing, May 2002.
  17. Huber, R. and Kollmeier, B., ‘PEMO-Q – A new method for objective audio quality assessment using a model of auditory perception.’ IEEE Transactions on Audio, Speech and Language Processing, 14(6), 2006, 1902–1911.
  18. Kandadai, S., Hardin, J., and Creusere, C.D., ‘Audio quality assessment using the mean structural similarity measure.’ IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2008.
  19. Daly, S., ‘The visible differences predictor: An algorithm for the assessment of image fidelity.’ In Watson, A.B. (ed.), Digital Images and Human Vision. MIT Press, Cambridge, MA, 1993, pp. 179–206.
  20. Lubin, J., ‘A visual discrimination model for imaging system design and evaluation.’ In Peli, E. (ed.), Vision Models for Target Detection and Recognition. World Scientific, Singapore, 1995, pp. 245–283.
  21. Watson, A.B., Hu, J., and McGowan, J.F., ‘DVQ: A digital video quality metric based on human vision.’ Journal of Electronic Imaging, 10(1), 2001, 20–29.
  22. Winkler, S., ‘A perceptual distortion metric for digital color video.’ Proceedings of SPIE no. 3644, 1999, pp. 175–184.
  23. Wolf, S., ‘Measuring the end-to-end performance of digital video systems.’ IEEE Transactions on Broadcasting, 43(3), 1997, 320–328.
  24. Wang, Z., Bovik, A.C., and Evan, B.L., ‘Blind measurement of blocking artifacts in images.’ IEEE International Conference on Image Processing, September 2002.
  25. Miyahara, M., Kotani, K., and Algazi, V.R., ‘Objective picture quality scale (PQS) for image coding.’ IEEE Transactions on Commununications, 46(9), 1998, 1215–1225.
  26. Marziliano, P., Dufaux, F., Winkler, S., and Ebrahimi, T., ‘A no-reference perceptual blur metric.’ IEEE International Conference on Image Processing, September 2002.
  27. Wu, H.R. and Yuen, M., ‘A generalized block-edge impairment metric (GBIM) for video coding.’ IEEE Signal Processing Letters, 4(11), 1997, 317–320.
  28. Jin, L., Boev, A., Gotchev, A., and Egiazarian, K., ‘3D-DCT based perceptual quality assessment of stereo video.’ IEEE International Conference on Image Processing, September 2011.
  29. Yu, Z., Wu, H.R., Winkler, S., and Chen, T., ‘Vision-model-based impairment metric to evaluate blocking artifacts in digital video.’ Proceedings of the IEEE, 90, 2002, 154–169.
  30. Schroeder, M.R., Atal, B.S., and Hall, J.L., ‘Optimizing digital speech coders by exploiting masking properties of the human ear.’ Journal of the Acoustical Society of America, 66(6), 1979, 1647–1652.
  31. Chen, G. and Parsa, V., ‘Loudness pattern-based speech quality evaluation using Bayesian modeling and Markov chain Monte Carlo methods.’ Journal of the Acoustical Society of America, 121(2), 2007, EL77–EL83.
  32. Manders, A.J., Simpson, D.M., and Bell, S.L., ‘Objective prediction of the sound quality of music processed by an adaptive feedback canceller.’ IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 2012, 1734–1745.
  33. Chandler, D.M. and Hemami, S.S., ‘VSNR: A wavelet-based visual signal-to-noise ratio for natural images.’ IEEE Transactions on Image Processing, 16(9), 2007, 2284–2298.
  34. Sheikh, H.R., Bovik, A.C., and de Veciana, G., ‘An information fidelity criterion for image quality assessment using natural scene statistics.’ IEEE Transactions on Image Processing, 14(12), 2005, 2117–2128.
  35. Sheikh, H.R. and Bovik, A.C., ‘Image information and visual quality.’ IEEE Transactions on Image Processing, 15(2), 2006, 430–444.
  36. Wang, Z., Bovik, A.C., Sheikh, H.R., and Simoncelli, E.P., ‘Image quality assessment: From error visibility to structural similarity.’ IEEE Transactions on Image Processing, 13(4), 2004, 600–612.
  37. Horita, Y., Miyata, T., Gunawan, I.P., Murai, T., and Ghanbari, M., ‘Evaluation model considering static-temporal quality degradation and human memory for SSCQE video quality.’ Proceedings of SPIE: Visual Communications and Image Processing, 5150(11), 2003, 1601–1611.
  38. Terhardt, E., ‘Calculating virtual pitch.’ Hearing Research, 1, 1979, 155–182.
  39. Dijk, J., van Grinkel, M., van Asselt, R.J., van Vliet, L.J., and Verbeek, P.W., ‘A new sharpness measure based on Gaussian lines and edges.’ Proceedings of the International Conference on Computational Analysis of Images and Patterns (CAIP). Lecture Notes in Computer Science, Vol. 2756. Springer-Verlag, Berlin, 2003, pp. 149–156.
  40. Ong, E., Lin, W., Lu, Z., Yao, S., Yang, X., and Jiang, L., ‘No reference JPEG-2000 image quality metric.’ Proceedings of IEEE International Conference Multimedia and Expo (ICME), 2003, pp. 545–548.
  41. Muijs, R. and Kirenko, I., ‘A no-reference block artifact measure for adaptive video processing.’ EUSIPCO 2005.
  42. Faugeras, O.D., ‘Digital color image processing within the framework of a human visual model.’ IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 1979, 380–393.
  43. Lukas, F., and Budrikis, Z., ‘Picture quality prediction based on a visual model.’ IEEE Transactions on Communications, 30, 1982, 1679–1692.
  44. You, J., Reiter, U., Hannuksela, M.M., Gabbouj, M., and Perkis, A., ‘Perceptual-based quality assessment for audio-visual services: A survey.’ Signal Processing: Image Communication, 25(7), 2010, 482–501.
  45. Tong, X., Heeger, D., and Lambrecht, C.V.D.B., ‘Video quality evaluation using STCIELAB.’ Proceedings of SPIE: Human Vision, Visual Processing and Digital Display, 3644, 1999, 185–196.
  46. Winkler, S., ‘Vision models and quality metrics for image processing applications.’ Swiss Federal Institute of Technology, Thesis 2313, December 2000, Lausanne, Switzerland.
  47. Sarnoff Corporation. ‘Sarnoff JND vision model.’ In Lubin, J. (ed.), Contribution to IEEE G-2.1.6 Compression and Processing Subcommittee, 1997.
  48. Liu, A., Lin, W., Paul, M., Deng, C., and Zhang, F., ‘Just noticeable difference for images with decomposition model for separating edge and textured regions.’ IEEE Transactions on Circuits and Systems for Video Technology, 20(11), 2010, 1648–1652.
  49. Ong, E., Lin, W., Lu, Z., Yao, S., and Etoh, M., ‘Visual distortion assessment with emphasis on spatially transitional regions.’ IEEE Transactions on Circuits and Systems for Video Technology, 14(4), 2004, 559–566.
  50. Masry, M.A., Hemami, S.S., and Sermadevi, Y., ‘A scalable wavelet-based video distortion metric and applications.’ IEEE Transactions on Circuits and Systems for Video Technology, 16(2), 2006, 260–273.
  51. Pinson, M.H. and Wolf, S., ‘A new standardized method for objectively measuring video quality.’ IEEE Transactions on Broadcasting, 50(3), 2004, 312–322.
  52. Watson, A.B., ‘DCTune: A technique for visual optimization of DCT quantization matrices for individual images.’ Society for Information Display Digest of Technical Papers, Vol. XXIV, 1993, pp. 946–949.
  53. Lin, W., Dong, L., and Xue, P., ‘Visual distortion gauge based on discrimination of noticeable contrast changes.’ IEEE Transactions on Circuits and Systems for Video Technology, 15(7), 2005, 900–909.
  54. Wang, Z., and Bovik, A.C., ‘A universal image quality index.’ IEEE Signal Processing Letters, 9(3), 2002, 81–84.
  55. Colomes, C., Lever, M., Rault, J.B., and Dehery, Y.F., ‘A perceptual model applied to audio bit-rate reduction.’ Journal of the Audio Engineering Society, 43(4), 1995, 233–240.
  56. Thiede, T., Treurniet, W.C., Bitto, R., et al., ‘PEAQ – The ITU standard for objective measurement of perceived audio quality.’ Journal of the Audio Engineering Society, 48(1/2), 2000, 3–29.
  57. Barbedo, J. and Lopes, A., ‘A new cognitive model for objective assessment of audio quality.’ Journal of the Audio Engineering Society, 53(1/2), 2005, 22–31.
  58. Zielinski, S., Rumsey, F., Kassier, R., and Bech, S., ‘Development and initial validation of a multichannel audio quality expert system.’ Journal of the Audio Engineering Society, 53(1/2), 2005, 4–21.
  59. Vanam, R., and Creusere, C., ‘Evaluating low bitrate scalable audio quality using advanced version of PEAQ and energy equalization approach.’ Proceedings of IEEE ICASSP, Vol. 3, 2005, pp. 189–192.
  60. Creusere, C., Kallakuri, K., and Vanam, R., ‘An objective metric of human subjective audio quality optimized for a wide range of audio fidelities.’ IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 2008, 129–136.
  61. Novorita, B., ‘Incorporation of temporal masking effects into Bark spectral distortion measure.’ Proceedings of IEEE ICASSP, Vol. 2, 1999, pp. 665–668.
  62. Rix, A.W., Beerends, J.G., Kim, D., Kroon, P., and Ghitza, O., ‘Objective assessment of speech and audio quality-technology and applications.’ IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 2006, 1890–1901.
  63. Brandenburg, K., ‘Evaluation of quality for audio encoding at low bit rates.’ Proceedings of 82nd Audio Engineering Society Convention, 1987, preprint 2433.
  64. Quackenbush, S.R., Barnwell, T.P., and Clements, M.A., Objective Measures of Speech Quality. Prentice-Hall, Englewood Cliffs, NJ, 1988.
  65. ‘Objective quality measurement of telephone-band (300–3400 Hz) speech codecs.’ ITU-T P.861, 1998.
  66. Karjalainen, M., ‘A new auditory model for the evaluation of sound quality of audio system.’ Proceedings of IEEE ICASSP, 1985, pp. 608–611.
  67. Beerends, J.G. and Stemerdink, J.A., ‘A perceptual audio quality measure based on a psychoacoustic sound representation. Journal of the Audio Engineering Society, 40(12), 1992, 963–974.
  68. Hansen, M. and Kollmeier, B., ‘Using a quantitative psycho-acoustical signal representation for objective speech quality measurement.’ Proceedings of ICASSP, 1997, pp. 1387–1390.
  69. Hauenstein, M., ‘Application of Meddis’ inner hair-cell model to the prediction of subjective speech quality.' Proceedings of IEEE ICASSP, 1998, pp. 545–548.
  70. Moore, B.C.J., Tan, C.-T., Zacharov, N., and Mattila, V.-V., ‘Measuring and predicting the perceived quality of music and speech subjected to combined linear and nonlinear distortion.’ Journal of the Audio Engineering Society, 52(12), 2004, 1228–1244.
  71. ‘Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.’ ITU-T P.862, 2001.
  72. Beerends, J.G. and Stemerdink, J.A., ‘The optimal time-frequency smearing and amplitude compression in measuring the quality of audio devices.’ Proceedings of 94th Audio Engineering Society Convention, 1993.
  73. Ghitza, O., ‘Auditory models and human performance in tasks related to speech coding and speech recognition.’ IEEE Transactions on Speech and Audio Processing, 2(1), 1994, 115–132.
  74. Rix, A.W. and Hollier, M.P., ‘The perceptual analysis measurement system for robust end-to-end speech quality assessment.’ Proceedings of IEEE ICASSP, Vol. 3, 2000, pp. 1515–1518.
  75. Beerends, J.G. and Stemerdink, J.A., ‘A perceptual speech quality measure based on a psychoacoustic sound representation.’ Journal of the Audio Engineering Society, 42(3), 1994, 115–123.
  76. George, S., Zielinski, S., and Rumsey, F., ‘Feature extraction for the prediction of multichannel spatial audio fidelity.’ IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 2006, 1994–2005.
  77. Rehman, A., and Wang, Z., ‘Reduced-reference image quality assessment by structural similarity estimation.’ IEEE Transactions on Image Processing, 21(8), 2012, 3378–3389.
  78. Hewage, C.T.E.R. and Martini, M.G., ‘Reduced-reference quality assessment for 3D video compression and transmission.’ IEEE Transactions on Consumer Electronics, 57(3), 2011, 1185–1193.
  79. Kusuma, T.M. and Zepernick, H.-J., ‘A reduced-reference perceptual quality metric for in-service image quality assessment.’ Proceedings of 1st Workshop on Mobile Future and Symposium on Trends in Communications, October 2003, pp. 71–74.
  80. Wolf, S. and Pinson, M.H., ‘Spatio-temporal distortion metrics for in-service quality monitoring of any digital video system.’ Proceedings of SPIE, Vol. 3845, 1999, pp. 266–277.
  81. Gunawan, I. and Ghanbari, M., ‘Reduced-reference video quality assessment using discriminative local harmonic strength with motion consideration.’ IEEE Transactions on Circuits and Systems for Video Technology, 18(1), 2008, 71–83.
  82. Yang, S., ‘Reduced reference MPEG-2 picture quality measure based on ratio of DCT coefficients.’ Electronics Letters, 47(6), 2011, 382–383.
  83. Altous, S., Samee, M.K., and Gotze, J., ‘Reduced reference image quality assessment for JPEG distortion.’ ELMAR Proceedings, September 2011, pp. 97–100.
  84. Oelbaum, T. and Diepold, K., ‘Building a reduced reference video quality metric with very low overhead using multivariate data analysis.’ Journal of Systemics, Cybernetics, and Informatics, 6(5), 2008, 81–86.
  85. Wang, Z. and Bovik, A.C., ‘Reduced and no reference visual quality assessment – the natural scene statistic model approach.’ IEEE Signal Processing Magazine, Special Issue on Multimedia Quality Assessment, 29(6), 2011, 29–40.
  86. Carnec, M., Le Callet, P., and Barba, D., ‘Visual features for image quality assessment with reduced reference.’ Proceedings of IEEE International Conference on Image Processing, Vol. 1, September 2005, pp. 421–424.
  87. Le Callet, P., Viard-Gaudin, C., and Barba, D., ‘Continuous quality assessment of MPEG2 video with reduced reference.’ Proceedings of International Workshop on Video Processing Quality Metrics for Consumer Electronics, Scottsdale, AZ, January 2005.
  88. Carnec, M., Le Callet, P., and Barba, D., ‘Objective quality assessment of color images based on a generic perceptual reduced reference.’ Signal Processing: Image Communication, 23(4), 2008, 239–256.
  89. Amirshahi, S.A. and Larabi, M., ‘Spatial-temporal video quality metric based on an estimation of QoE.’ Third International Workshop on Quality of Multimedia Experience (QoMEX), September 2011, pp. 84–89.
  90. Tao, D., Li, X., Lu, W., and Gao, X., ‘Reduced-reference IQA in contourlet domain.’ IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(6), 2009, 1623–1627.
  91. Engelke, U., Kusuma, M., Zepernick, H.-J., and Caldera, M., ‘Reduced-reference metric design for objective perceptual quality assessment in wireless imaging.’ Signal Processing: Image Communication, 24(7), 2009, 525–547.
  92. Ma, L., Li, S., and Ngan, K.N., ‘Reduced-reference video quality assessment of compressed video sequences.’ IEEE Transactions on Circuits and Systems for Video Technology, 22(10), 2012, 1441–1456.
  93. Yuan, F. and Cheng, E., ‘Reduced-reference metric design for video quality measurement in wireless application.’ 11th IEEE International Conference on Communication Technology (ICCT), November 2008, pp. 641–644.
  94. Zhai, G., Zhang, W., Yang, X., and Xu, Y., ‘Image quality assessment metrics based on multi-scale edge presentation.’ IEEE Workshop on Signal Processing Systems Design and Implementation, November 2005, pp. 331–336.
  95. Narwaria, M., Lin, W., McLoughlin, I.V., Emmanuel, S., and Chia, L.-T., ‘Fourier transform-based scalable image quality measure.’ IEEE Transactions on Image Processing, 21(8), 2012, 3364–3377.
  96. Chono, K., Lin, Y.-C., Varodayan, D., Miyamoto, Y., and Girod, B., ‘Reduced-reference image quality assessment using distributed source coding.’ IEEE International Conference on Multimedia and Expo, April 2008, pp. 609–612.
  97. Soundararajan, R. and Bovik, A.C., ‘Video quality assessment by reduced reference spatio-temporal entropic differencing.’ IEEE Transactions on Circuits and Systems for Video Technology, 23(4), 2013, 684–694.
  98. Soundararajan, R. and Bovik, A.C., ‘RRED indices: Reduced reference entropic differencing for image quality assessment.’ IEEE Transactions on Image Processing, 21(2), 2012, 517–526.
  99. Redi, J.A., Gastaldo, P., Heynderickx, I., and Zunino, R., ‘Color distribution information for the reduced-reference assessment of perceived image quality.’ IEEE Transactions on Circuits and Systems for Video Technology, 20(12), 2010, 1757–1769.
  100. Zeng, K. and Wang, Z., ‘Temporal motion smoothness measurement for reduced-reference video quality assessment.’ IEEE International Conference on Acoustics, Speech, and Signal Processing, March 2010, pp. 1010–1013.
  101. Cheng, G. and Cheng, L., ‘Reduced reference image quality assessment based on dual derivative priors.’ Electronics Letters, 45(18), 2009, 937–939.
  102. Ma, L., Li, S., Zhang, F., and Ngan, K.N., ‘Reduced-reference image quality assessment using reorganized DCT-based image representation.’ IEEE Transactions on Multimedia, 13(4), 2011, 824–829.
  103. Albonico, A., Valenzise, G., Naccari, M., Tagliasacchi, M., and Tubaro, S., ‘A reduced-reference video structural similarity metric based on no-reference estimation of channel-induced distortion.’ IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2009, pp. 1857–1860.
  104. Wang, X., Jiang, G., and Yu, M., ‘Reduced reference image quality assessment based on contourlet domain and natural image statistics.’ 5th International Conference on Image and Graphics (ICIG), September 2009, pp. 45–50.
  105. Wang, Z. and Simoncelli, E.P., ‘Reduced-reference image quality assessment using a wavelet-domain natural image statistic model.’ Proceedings of SPIE: Human Vision and Electronic Imaging X, Vol. 5666, January 2005.
  106. Li, Q. and Wang, Z., ‘Reduced-reference image quality assessment using divisive normalization-based image representation.’ IEEE Journal on Selected Topics in Signal Processing, 3(2), 2009, 202–211.
  107. Atzori, L., Ginesu, G., Giusto, D.D., and Floris, A., ‘Streaming video over wireless channels: Exploiting reduced-reference quality estimation at the user-side.’ Signal Processing: Image Communication, 27(10), 2012, 1049–1065.
  108. Atzori, L., Ginesu, G., Giusto, D.D., and Floris, A., ‘Rate control based on reduced-reference image quality estimation for streaming video over wireless channels.’ IEEE International Conference on Communications (ICC), June 2012, pp. 2021–2025.
  109. Audhkhasi, K. and Kumar, A., ‘Two scale auditory feature based nonintrusive speech quality evaluation.’ IETE Journal of Research, 56(2), 2010, 111–118.
  110. Hemami, S. and Reibman, A., ‘No-reference image and video quality estimation: Applications and human-motivated design.’ Signal Processing: Image Communication, 25(7), 2010, 469–481.
  111. Narwaria, M. and Lin, W., ‘SVD-based quality metric for image and video using machine learning.’ IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(2), 2012, 347–364.
  112. Turaga, D.S., Chen, Y., and Caviedes, J., ‘No reference PSNR estimation for compressed pictures.’ Signal Processing: Image Communication, 19, 2004, 173–184.
  113. Au, O.L. and Lam, K., ‘A novel output-based objective speech quality measure for wireless communication.’ Proceedings of 4th International Conference on Signal Processing, Vol. 1, 1998, pp. 666–669.
  114. Ichigaya, A., Nishida, Y., and Nakasu, E., ‘Non reference method for estimating PSNR of MPEG-2 coded video by using DCT coefficients and picture energy.’ IEEE Transactions on Circuits and Systems for Video Technology, 18(6), 2008, 817–826.
  115. Falk, T., Xu, Q., and Chan, W.Y., ‘Non-intrusive GMM-based speech quality measurement.’ Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, pp. 125–128.
  116. Brandao, T. and Queluz, M.P., ‘No-reference image quality assessment based on DCT domain statistics.’ Signal Processing, 88, 2008, 822–833.
  117. Eden, A., ‘No-reference estimation of the coding PSNR for H.264-coded sequences.’ IEEE Transactions on Consumer Electronics, 53(2), 2007, 667–674.
  118. Choe, J. and Lee, C., ‘Estimation of the peak signal-to-noise ratio for compressed video based on generalized Gaussian modelling.’ Optical Engineering, 46(10), 2007, 107401.
  119. Chen, G. and Parsa, V., ‘Bayesian model based non-intrusive speech quality evaluation.’ Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, pp. 385–388.
  120. Shim, S.-Y., Moon, J.-H., and Han, J.-K., ‘PSNR estimation scheme using coefficient distribution of frequency domain in H.264 decoder.’ Electronics Letters, 44(2), 2008, 108–109.
  121. Reibman, A.R., Vaishampayan, V.A., and Sermadevi, Y., ‘Quality monitoring of video over a packet network.’ IEEE Transactions on Multimedia, 6(2), 2004, 327–334.
  122. Tobias, J., Foundations of Modern Auditory Theory. Academic Press, New York, 1970.
  123. Grancharov, V., David, Y., Jonas, L., and Bastiaan, W., ‘Low complexity nonintrusive speech quality assessment.’ IEEE Transactions on Speech and Audio Processing, 14(6), 2006, 1948–1956.
  124. Nishikawa, K., Munadi, K., and Kiya, H., ‘No-reference PSNR estimation for quality monitoring of motion JPEG2000 video over lossy packet networks.’ IEEE Transactions on Multimedia, 10(4), 2008, 637–645.
  125. Narwaria, M., Lin, W., McLoughlin, I.V., Emmanuel, S., and Chia, L.-T., ‘Nonintrusive quality assessment of noise suppressed speech with mel-filtered energies and support vector regression.’ IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 2012, 1217–1232.
  126. Li, X., ‘Blind image quality assessment.’ IEEE International Conference on Image Processing, 2002.
  127. Falk, T., Yuan, H., and Chan, W.Y., ‘Single-ended quality measurement of noise suppressed speech based on Kullback–Leibler distances.’ Journal of Multimedia, 2(5), 2007, 19–26.
  128. Falk, T. and Chan, W.Y., ‘Single-ended speech quality measurement using machine learning methods.’ IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 2006, 1935–1947.
  129. Kayargadde, V. and Martens, J.-B., ‘An objective measure for perceived noise.’ Signal Processing, 49(3), 1996, 187–206.
  130. Jesteadt, W., Wier, C., and Green, D., ‘Intensity discrimination as a function of frequency and sensation level.’ Journal of the Acoustical Society of America, 61(1), 1977, 169–177.
  131. Painter, T. and Spanias, A., ‘Perceptual coding of digital audio.’ Proceedings of the IEEE, 88(4), 2000, 451–513.
  132. Suthaharan, S., ‘A perceptually significant block-edge impairment metric for digital video coding.’ Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2003, pp. III-681–III-684.
  133. Ha, K. and Kim, M., ‘A perceptual quality assessment metric using temporal complexity and disparity information for stereoscopic video.’ IEEE International Conference on Image Processing, September 2011, pp. 2525–2528.
  134. Liu, S. and Bovik, A.C., ‘Efficient DCT-domain blind measurement and reduction of blocking artifacts.’ IEEE Transactions on Circuits and Systems for Video Technology, 12(12), 2002, 1139–1149.
  135. Zhai, G., Zhang, W., Yang, X., Lin, W., and Xu, Y., ‘No-reference noticeable blockiness estimation in images.’ Signal Processing: Image Communication, 23, 2008, 417–432.
  136. Meesters, L. and Martens, J.-B., ‘A single-ended blockiness measure for JPEG-coded images.’ Signal Processing, 82, 2002, 369–387.
  137. Ferzli, R. and Karam, L.J., ‘A no-reference objective image sharpness metric based on the notion of just noticeable blur JNB.’ IEEE Transactions on Image Processing, 18(4), 2009, 717–728.
  138. Marziliano, P., Dufaux, F., Winkler, S., and Ebrahimi, T., ‘Perceptual blur and ringing metrics: Application to JPEG2000.’ Signal Processing: Image Communication, 19, 2004, 163–172.
  139. Kanumuri, S., Cosman, P.C., Reibman, A.R., and Vaishampayan, V.A., ‘Modeling packet-loss visibility in MPEG-2 video.’ IEEE Transactions on Multimedia, 8(2), 2006, 341–355.
  140. Ong, E., Lin, W., Lu, Z., Yang, X., Yao, S., Jiang, L., and Moschetti, F., ‘A no-reference quality metric for measuring image blur.’ IEEE International Symposium on Signal Processing and its Applications, 2003, pp. 469–472.
  141. Gu, K., Zhai, G., Yang, X., and Zhang, W., ‘No-reference stereoscopic IQA approach: From nonlinear effect to parallax compensation.’ Journal of Electrical and Computer Engineering, 2012, 1.
  142. Winkler, S. and Faller, C., ‘Audiovisual quality evaluation of low-bitrate video.’ Proceedings of SPIE Human Vision and Electronic Imaging, Vol. 5666, January 2005, pp. 139–148.
  143. Marichal, X., Ma, W.-Y., and Zhang, H.-J., ‘Blur determination in the compressed domain using DCT information.’ IEEE International Conference on Image Processing, 1999, pp. 386–390.
  144. Yang, K.-C., Guest, C.C., and Das, P.K., ‘Perceptual sharpness metric (PSM) for compressed video.’ IEEE International Conference on Multimedia and Expo, 2006.
  145. Blanchet, G., Moisan, L., and Rouge, B., ‘Measuring the global phase coherence of an image.’ IEEE International Conference on Image Processing, 2008, pp. 1176–1179.
  146. Feng, X. and Allebach, J.P., ‘Measurement of ringing artifacts in JPEG images.’ SPIE, Vol. 6076, 2006.
  147. Liu, H., Klomp, N., and Heynderickx, I., ‘A no-reference metric for perceived ringing.’ International Workshop on Video Processing and Quality Metrics, 2009.
  148. Sheikh, H.R., Bovik, A.C., and Cormak, L., ‘No-reference quality assessment using natural scene statistics: JPEG 2000.’ IEEE Transactions on Image Processing, 14(11), 2005, 1918–1927.
  149. Susstrunk, S.E. and Winkler, S., ‘Color image quality on the Internet.’ SPIE, Vol. 5304, 2004.
  150. Davis, A.G., Bayart, D., and Hands, D.S., ‘Hybrid no-reference video quality prediction.’ IEEE International Symposium on Broadband Multimedia Systems, 2009.
  151. Hands, D., Bayart, D., Davis, A., and Bourret, A., ‘No reference perceptual quality metrics: Approaches and limitations.’ Human Vision and Electronic Imaging XIV, 2009.
  152. Engelke, U. and Zepernick, H.-J., ‘Pareto optimal weighting of structural impairments for wireless imaging quality assessment.’ IEEE International Conference on Image Processing, 2008, pp. 373–376.
  153. Kuszpet, Y., Kletsel, D., Moshe, Y., and Levy, A., ‘Post-processing for flicker reduction in H.264/AVC.’ Picture Coding Symposium, 2007.
  154. Pastrana-Vidal, R.R. and Gicquel, J.-C., ‘Automatic quality assessment of video fluidity impairments using a no-reference metric.’ International Workshop on Video Processing and Quality Metrics, 2006.
  155. ‘Single-ended method for objective speech quality assessment in narrow-band telephony applications.’ ITU-T P.563, 2004.
  156. Yang, K.-C., Guest, C.C., El-Maleh, K., and Das, P.K., ‘Perceptual temporal quality metric for compressed video.’ IEEE Transactions on Multimedia, 9(7), 2007, 1528–1535.
  157. Bradley, A.P. and Stentiford, F.W.M., ‘Visual attention for region of interest coding in JPEG 2000.’ Journal of Visual Communication and Image Representation, 14(3), 2003, 232–250.
  158. Wolfgang, R.B., Podilchuk, C.I., and Delp, E.J., ‘Perceptual watermarks for digital images and video.’ Proceedings of IEEE, 87(7), 1999, 1108–1126.
  159. Frossard, P. and Verscheure, O., ‘Joint source/FEC rate selection for quality-optimal MPEG-2 video delivery.’ IEEE Transactions on Image Processing, 10(12), 2001, 1815–1825.
  160. Ramasubramanian, M., Pattanaik, S.N., and Greenberg, D.P., ‘A perceptual based physical error metric for realistic image synthesis.’ Computer Graphics (SIGGRAPH ‘99 Conference Proceedings), 33(4), 1999, 73–82.
  161. Wandell, B., Foundations of Vision. Sinauer Associates, Sunderland, MA, 1995.
  162. Kelly, D.H., ‘Motion and vision II: Stabilized spatiotemporal threshold surface.’ Journal of the Optical Society of America, 69(10), 1979, 1340–1349.
  163. Moorthy, A.K. and Bovik, A.C., ‘A survey on 3D quality of experience and 3D quality assessment.’ SPIE Proceedings: Human Vision and Electronic Imaging, 2013.
  164. Legge, G.E. and Foley, J.M., ‘Contrast masking in human vision.’ Journal of the Optical Society of America, 70, 1980, 1458–1471.
  165. Yang, X., Lin, W., Lu, Z., Ong, E., and Yao, S., ‘Motion-compensated residue preprocessing in video coding based on just-noticeable-distortion profile.’ IEEE Transactions on Circuits and Systems for Video Technology, 15(6), 2005, 742–750.
  166. Poirson, A.B. and Wandell, B.A., ‘Pattern-color separable pathways predict sensitivity to simple colored patterns.’ Visible Research, 36(4), 1996, 515–526.
  167. Zhang, X. and Wandell, B.A., ‘Color image fidelity metrics evaluated using image distortion maps.’ Signal Processing, 70(3), 1998, 201–214.
  168. Yang, X., Lin, W., Lu, Z., Ong, E., and Yao, S., ‘Just noticeable distortion model and its applications in video coding.’ Signal Processing: Image Communication, 20(7), 2005, 662–680.
  169. Wu, J., Lin, W., Shi, G., and Liu, A., ‘Perceptual quality metric with internal generative mechanism.’ IEEE Transactions on Image Processing, 22(1), 2013, 43–54.
  170. Winkler, S., ‘Quality metric design: A closer look.’ SPIE Proceedings: Human Vision and Electronic Imaging Conference, Vol. 3959, 2000, pp. 37–44.
  171. Yuen, M. and Wu, H.R., ‘A survey of MC/DPCM/DCT video coding distortions.’ Signal Processing, 70(3), 1998, 247–278.
  172. Frederickson, R.E. and Hess, R.F., ‘Estimating multiple temporal mechanisms in human vison.’ Vision Research, 38(7), 1998, 1023–1040.
  173. Daugman, J.G., ‘Two-dimensional spectral analysis of cortical receptive field profiles.’ Vision Research, 20(10), 1980, 847–856.
  174. Watson, A.B., ‘The cortex transform: Rapid computation of simulated neural images.’ Computer Visual Graphics and Imaging Processes, 39(3), 1987, 311–327.
  175. Burt, P.J. and Adelson, E.H., ‘The Laplacian pyramid as a compact image code.’ IEEE Transactions on Communications, 31(4), 1983, 532–540.
  176. Simoncelli, E.P., Freeman, W.T., Adelson, E.H., and Heeger, D.J., ‘Shiftable multi-scale transforms.’ IEEE Transactions on Information Theory, 38(2), 1992, 587–607.
  177. Moore, B.C.J., An Introduction to the Psychology of Hearing, 4th edn. Academic Press, Norwell, MA, 1997.
  178. ‘Method for objective measurements of perceived audio quality.’ ITU-R BS.1387, 1999.
  179. Thiede, T. and Kabot, E., ‘A new perceptual quality measure for bit rate reduced audio.’ Proceedings of 100th Audio Engineering Society Convention, 1996.
  180. Thiede, T., ‘Perceptual audio quality assessment using a non-linear filter bank.’ PhD Thesis, Fachbereich Electrotechnik, Technical University of Berlin, 1999.
  181. Hubel, D.H., Eye, Brain, and Vision. W.H. Freeman, New York, 1988.
  182. Elder, J.H. and Zucker, S.W., ‘Local scale control for edge detection and blur estimation.’ IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(7), 1998, 699–716.
  183. Quan, H.-T. and Ghanbari, M., ‘Temporal aspect of perceived quality of mobile video broadcasting.’ IEEE Transactions on Broadcasting, 54(3), 2008, 641–651.
  184. Verscheure, O., Frossard, P., and Hamdi, M., ‘User-oriented QoS analysis in MPEG-2 delivery.’ Real-Time Imaging, 5(5), 1999, 305–314.
  185. Liang, J. and Kubichek, R., ‘Output-based objective speech quality.’ Proceedings of IEEE Vehicular Technology Conference, Stockholm, Sweden, 1994, pp. 1719–1723.
  186. Zhang, L., Zhang, L., Mou, X., and Zhang, D., ‘FSIM: A feature similarity index for image quality assessment.’ IEEE Transactions on Image Processing, 20(8), 2011, 2378–2386.
  187. Larson, E.C. and Chandler, D.M., ‘Most apparent distortion: Full reference image quality assessment and the role of strategy.’ Journal of Electronic Imaging, 19(1), 2010, 011006-1–011006-21.
  188. Narwaria, M., Lin, W., and Çetin, A.E., ‘Scalable image quality assessment with 2D mel-cepstrum and machine learning approach.’ Pattern Recognition, 45(1), 2012, 299–313.
  189. Liu, T.-J., Lin, W., and Kuo, C.-C.J., ‘Image quality assessment using multi-method fusion.’ IEEE Transactions on Image Processing, 22(5), 2013, 1793–1807.
  190. Huynh-Tuh, Q., Barkowsky, M., and Le Callet, P., ‘The importance of visual attention in improving the 3D-TV viewing experience: Overview and new perspectives.’ IEEE Transactions on Broadcasting, 57(2), 2011, 421–431.
  191. Peli, E., ‘Contrast in complex images.’ Journal of the Optical Society of America, 7(10), 1990, 2032–2040.
  192. Winkler, S. and Vandergheynst, P., ‘Computing isotropic local contrast from oriented pyramid decompositions.’ Proceedings of International Conference on Image Processing, 1999, pp. 420–424.
  193. Lai, Y.-K. and Kuo, C.-C.J., ‘A Haar wavelet approach to compressed image quality measurement.’ Journal of Visual Communication and Image Representation, 11(1), 2000, 17–40.
  194. Lu, Z., Lin, W., Yang, X., Ong, E., and Yao, S., ‘Modeling visual attention's modulatory aftereffects on visual sensitivity and quality evaluation.’ IEEE Transactions on Image Processing, 14(11), 2005, 1928–1942.
  195. Ong, E., Yang, X., Lin, W., et al., ‘Perceptual quality and objective quality measurements of compressed videos.’ Journal of Visual Communication and Image Representation, 17(4), 2006, 717–737.
  196. Tan, K.T. and Ghanbari, M., ‘Blockiness detection for MPEG2-coded video.’ IEEE Signal Processing Letters, 7(8), 2000, 213–215.
  197. Kundur, D. and Hatzinakos, D., ‘Blind image deconvolutions.’ IEEE Signal Processing Magazine, 13, 1996, 43–63.
  198. Wu, S., Lin, W., Xie, S., Lu, Z., Ong, E., and Yao, S., ‘Blind blur assessment for vision based applications.’ Journal of Visual Communication and Image Representation, 20(4), 2009, 231–241.
  199. Winkler, S., ‘Visual fidelity and perceived quality: Towards comprehensive metrics.’ Proceedings of SPIE, 4299, 2001, 114–125.
  200. Pastrana-Vidal, R., Gicquel, J., Colomes, C., and Cherifi, H., ‘Sporadic frame dropping impact on quality perception.’ SPIE Proceedings: The International Society for Optical Engineering, Vol. 5292, 2004.
  201. Lin, W., ‘Computational models for just-noticeable difference.’ In Wu, H.R. and Rao, K.R. (eds), Digital Video Image Quality and Perceptual Coding. CRC Press, Boca Raton, FL, 2006.
  202. Lu, Z., Lin, W., Boon, C.S., Kato, S., Ong, E., and Yao, S., ‘Perceptual quality evaluation on periodic frame-dropping video.’ IEEE International Conference on Image Processing (ICIP), 2007.
  203. Montenovo, M., Perot, A., Carli, M., Cicchetti, P., and Neri, A., ‘Objective quality evaluation of video services.’ Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics, January 2006.
  204. Suresh, N., Jayant, N., and Yang, O., ‘Mean time between failures: A subjectively meaningful quality metric for consumer video.’ Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics, January 2006.
  205. Zhang, X., Lin, W., and Xue, P., ‘Improved estimation for just-noticeable visual distortion.’ Signal Processing, 85(4), 2005, 795–808.
  206. Chou, C.H. and Li, Y.C., ‘A perceptually tuned subband image coder based on the measure of just-noticeable-distortion profile.’ IEEE Transactions on Circuits and Systems for Video Technology, 5(6), 1995, 467–476.
  207. Tong, H.Y. and Venetsanopoulos, A.N., ‘A perceptual model for jpeg applications based on block classification, texture masking, and luminance masking.’ Proceedings of the IEEE International Conference on Image Processing (ICIP), Vol. 3, 1998.
  208. Hontsch, I. and Karam, L.J., ‘Adaptive image coding with perceptual distortion control.’ IEEE Transactions on Image Processing, 11(3), 2002, 213–222.
  209. Daly, S., ‘Engineering observations from spatiovelocity and spatiotemporal visual models.’ In van den Branden Lambrecht, C.J. (ed.), Vision Models and Applications to Image and Video Processing. Kluwer Academic, Norwell, MA, 2001.
  210. Jia, Y., Lin, W., and Kassim, A.A., ‘Estimating just-noticeable distortion for video.’ IEEE Transactions on Circuits and Systems for Video Technology, 16(7), 2006, 820–829.
  211. Ahumada, A.J. and Peterson, H.A., ‘Luminance-model-based DCT quantization for color image compression.’ SPIE Proceedings: Human Vision, Visual Processing, and Digital Display III, 1992, pp. 365–374.
  212. Jayant, N., Johnston, J., and Safranek, R., ‘Signal compression based on models of human perception.’ Proceedings of IEEE, 81, 1993, 1385–1422.
  213. Wang, Z., Bovik, A.C., and Lu, L., ‘Wavelet-based foveated image quality measurement for region of interest image coding.’ Proceedings of International Conference on Image Processing, Vol. 2, 2001, pp. 89–92.
  214. Ahumada, A.J. and Krebs, W.K., ‘Masking in color images.’ SPIE Proceedings: Human Vision and Electronic Imaging VI, 2001, p. 4299.
  215. Chiu, Y.J. and Berger, T., ‘A software-only videocodec using pixelwise conditional differential replenishment and perceptual enhancements.’ IEEE Transactions on Circuits and Systems for Video Technology, 9(3), 1999, 438–450.
  216. Lin, W., Gai, Y., and Kassim, A.A., ‘A study on perceptual impact of edge sharpness in images.’ IEE Proceedings on Vision, Image, and Signal Processing, 153(2), 2006, 215–223.
  217. Chou, C.H. and Chen, C.W., ‘A perceptually optimized 3-D subband image codec for video communication over wireless channels.’ IEEE Transactions on Circuits and Systems for Video Technology, 6(2), 1996, 143–156.
  218. Zhang, X., Lin, W., and Xue, P., ‘Just-noticeable difference estimation with pixels in images.’ Journal of Visual Communication and Image Representation, 19(1), 2008, 30–41.
  219. Kollmeier, B., Brand, T., and Meyer, B., ‘Perception of speech and sound.’ In Benesty, J., Mohan Sondhi, M., and Huang, Y. (eds), Springer Handbook of Speech Processing. Springer-Verlag, Berlin, 2008, p. 65.
  220. Riesz, R., ‘Differential intensity sensitivity of the ear for pure tones.’ Physical Review, 31(5), 1928, 867–875.
  221. Zwicker, E. and Fastl, H., Psycho-Acoustics, Facts and Models. Springer-Verlag, Berlin, 1999.
  222. Plomp, R., ‘Rate of decay of auditory sensation.’ Journal of the Acoustical Society of America, 36(2), 1964, 277–282.
  223. Chun, M.M. and Wolfe, J.M., ‘Visual attention.’ In Goldstein, B. (ed.), Blackwell Handbook of Perception. Blackwell, Oxford, 2001, pp. 272–310.
  224. Posner, M.I., ‘Orienting of attention.’ Quarterly Journal of Experimental Psychology, 32, 1980, 2–25.
  225. Pashler, H.E., The Psychology of Attention. MIT Press, Boston, MA, 1998.
  226. Treisman, A.M. and Gelade, G., ‘A feature integration theory of attention.’ Cognitive Psychology, 12(1), 1980, 97–136.
  227. Desimone, R. and Duncan, J., ‘Neural mechanisms of selective visual attention.’ Annual Review of Neuroscience, 18, 1995, 193–222.
  228. Hopfinger, J.B., Buonocore, M.H., and Mangun, G.R., ‘The neural mechanisms of top-down attentional control.’ Nature Neuroscience, 3, 2000, 284–291.
  229. Navalpakkam, V. and Itti, L., ‘Top-down attention selection is fine-grained.’ Journal of Vision, 6(11), 2006, 1180–1193.
  230. Itti, L., Koch, C., and Niebur, E., ‘A model of saliency-based visual attention for rapid scene analysis.’ IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1998, 1254–1259.
  231. Hou, X. and Zhang, L., ‘Saliency detection: A spectral residual approach.’ IEEE Conference on Computer Visual Pattern Recognition, 2007.
  232. Borji, A. and Itti, L., ‘State-of-the-art in visual attention modeling.’ IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 2013, 185–207.
  233. Fang, Y., Chen, Z., Lin, W., and Lin, C.-W., ‘Saliency detection in the compressed domain for adaptive image retargeting.’ IEEE Transactions on Image Processing, 21(9), 2012, 3888–3901.
  234. Fang, Y., Lin, W., Lee, B.-S., Lau, C.T., Chen, Z., and Lin, C.-W., ‘Bottom-up saliency detection model based on human visual sensitivity and amplitude spectrum.’ IEEE Transactions on Multimedia, 14(1), 2012, 187–198.
  235. Ma, Y.-F., Hua, X.-S., Lu, L., and Zhang, H.-J., ‘A generic framework of user attention model and its application in video summarization.’ IEEE Transactions on Multimedia, 7(5), 2005, 907–919.
  236. Fang, Y., Lin, W., Chen, Z., Tsai, C.-M., and Lin, C.-W., ‘Video saliency detection in compressed domain.’ IEEE Transactions on Circuits and Systems for Video Technology, 24(1), 2014, 27–38.
  237. Wang, Z., Lu, L., and Bovik, A.C., ‘Foveation scalable video coding with automatic fixation selection.’ IEEE Transactions on Image Processing, 12, 2003, 1703–1705.
  238. Kalinli, O. and Narayanan, S., ‘A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech.’ Proceedings of Interspeech, 2007.
  239. Wrigley, S.N. and Brown, G.J., ‘A computational model of auditory selective attention.’ IEEE Transactions on Neural Networks, 15(5), 2004, 1151–1163.
  240. Kayser, C., Petkov, C.I., Lippert, M., and Logothetis, N.K., ‘Mechanisms for allocating auditory attention: An auditory saliency map.’ Current Biology, 15(21), 2005, 1943–1947.
  241. Damera-Venkata, N., Kite, T.D., Geisler, W.S., Evans, B.L., and Bovik, A.C., ‘Image quality assessment based on a degradation model.’ IEEE Transactions on Image Processing, 9(4), 2000, 636–650.
  242. Wang, Z., Simoncelli, E.P., and Bovik, A.C., ‘Multi-scale structural similarity for image quality assessment.’ Proceedings of Asilomar Conference on Signals, Systems and Computers, Vol. 2, 2003.
  243. Horé, A. and Ziou, D., ‘Is there a relationship between peak-signal-to-noise ratio and structural similarity index measure?’ IET Image Processing, 7(1), 2013, 12–24.
  244. Liu, A., Lin, W., and Narwaria, M., ‘Image quality assessment based on gradient similarity.’ IEEE Transactions on Image Processing, 21(4), 2012, 1500–1512.
  245. Zhai, G., Wu, X., Yang, X., Lin, W., and Zhang, W., ‘A psychovisual quality metric in free-energy principle.’ IEEE Transactions on Image Processing, 21(1), 2012, 41–52.
  246. Winkler, S. and Min, D., ‘Stereo/multiview picture quality: Overview and recent advances.’ Signal Processing: Image Communication, 28(10), 2013, 1358–1373.
  247. Huynh-Thu, Q., Le Callet, P., and Barkowsky, M., ‘Video quality assessment: From 2D to 3D – challenges and future trends.’ IEEE International Conference on Image Processing, 2010.
  248. Benoit, A., Le Callet, P., Campisi, P., and Cousseau, R., ‘Quality assessment of stereoscopic images.’ EURASIP Journal on Image and Video Processing, 2008, 2009, 1–13.
  249. You, J., Xing, L., Perkis, A., and Wang, X., ‘Perceptual quality assessment for stereoscopic images based on 2D image quality metrics and disparity analysis.’ Proceedings of International Workshop on Video Processing and Quality Metrics, 2010.
  250. Yang, J., Hou, C., Zhou, Y., Zhang, Z., and Guo, J., ‘Objective quality assessment method of stereo images.’ 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, 2009, pp. 1–4.
  251. Shen, L., Yang, J., and Zhang, Z., ‘Stereo picture quality estimation based on a multiple channel HVS model.’ IEEE International Congress on Image and Signal Processing, 2009.
  252. Lambooij, M., IJsselsteijn, W., Bouwhuis, D., and Heynderickx, I., ‘Evaluation of stereoscopic images: Beyond 2D quality.’ IEEE Transactions on Broadcasting, 57(2), 2011, 432–444.
  253. Shao, F., Lin, W., Gu, S., Jiang, G., and Srikanthan, T., ‘Perceptual full-reference quality assessment of stereoscopic images by considering binocular visual characteristics.’ IEEE Transactions on Image Processing, 22(5), 2013, 1940–1953.
  254. Sazzad, Z.M.P., Yamanaka, S., Kawayoke, Y., and Horita, Y., ‘Stereoscopic image quality prediction.’ IEEE Quality of Media Experience, 2009.
  255. van den Branden Lambrecht, C.J. and Verscheure, O., ‘Perceptual quality measure using a spatio-temporal model of the human visual system.’ Proceedings of SPIE Digital Video Compression: Algorithms and Technologies, Vol. 2668, 1996, pp. 450–461.
  256. Wang, Z., Lu, L., and Bovik, A., ‘Video quality assessment based on structural distortion measurement.’ Signal Processing: Image Communication, 19(2), 2004, 121–132.
  257. Lu, L., Wang, Z., Bovik, A., andKouloheris, J., ‘Full-reference video quality assessment considering structural distortion and no-reference quality evaluation of MPEG video.’ Proceedings of IEEE International Conference and Multimedia Expo, 2002.
  258. Wang, Z. and Li, Q., ‘Video quality assessment using a statistical model of human visual speed perception.’ Journal of the Optical Society of America A: Optics, Image Science, and Vision, 24(12), 2007, B61–B69.
  259. Stocker, A.A. and Simoncelli, E.P., ‘Noise characteristics and prior expectations in human visual speed perception.’ Nature Neuroscience, 9, 2006, 578–585.
  260. Tao, P. and Eskicioglu, A.M., ‘Video quality assessment using M-SVD.’ Proceedings of the International Society of Optical Engineers, 2007.
  261. Bhat, A., Richardson, I., and Kannangara, S., ‘A new perceptual quality metric for compressed video.’ IEEE Conference on Acoustics, Speech, and Signal Processing, 2009.
  262. Lee, C. and Kwon, O., ‘Objective measurements of video quality using the wavelet transform.’ Optical Engineering, 42(1), 2003, 265–272.
  263. Seshadrinathan, K. and Bovik, A.C., ‘Motion tuned spatio-temporal quality assessment of natural videos.’ IEEE Transactions on Image Processing, 19(2), 2010, 335–350.
  264. Ong, E., Yang, X., Lin, W., Lu, Z., and Yao, S., ‘Video quality metric for low bitrate compressed video.’ Proceedings of International Conference on Image Processing, 2004.
  265. Ong, E., Lin, W., Lu, Z., and Yao, S., ‘Colour perceptual video quality metric.’ Proceedings of International Conference on Image Processing, 2006.
  266. Ndjiki-Nya, P., Barrado, M., and Wiegand, T., ‘Efficient full-reference assessment of image and video quality.’ Proceedings of International Conference on Image Processing, 2007.
  267. Pinson, M. and Wolf, S., ‘Application of the NTIA general video quality metric VQM to HDTV quality monitoring.’ 3rd International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM2007), 2007.
  268. Sugimoto, O., Naito, S., Sakazawa, S., and Koike, A., ‘Objective perceptual picture quality measurement method for high-definition video based on full reference framework.’ Proceedings of the International Society of Optical Engineers, Vol. 7242, 2009.
  269. Okamoto, J., Watanabe, K., Hondaii, A., Uchida, M., and Hangai, S., ‘HDTV objective video quality assessment method applying fuzzy measure.’ Proceedings of International Workshop on Quality Multimedia Experience (QoMEX), 2009.
  270. Narwaria, M., Lin, W., and Liu, A., ‘Low-complexity video quality assessment using temporal quality variations.’ IEEE Transactions on Multimedia, 14(3–1), 2012, 525–535.
  271. Tan, K.T. and Ghanbari, M., ‘A multimetric objective picture-quality measurement model for MPEG video.’ IEEE Transactions on Circuits and Systems for Video Technology, 10(7), 2000, 1208–1213.
  272. Song, L., Tang, X., Zhang, W., Yang, X., and Xia, P., ‘The SHTU 4K video sequence dataset.’ Proceedings of International Workshop on Quality of Multimedia Experience (QoMEX), 2013.
  273. Bae, S.-H., Kim, J., Kim, M., Cho, S., and Choi, J.S., ‘Assessments of subjective video quality on HEVC-encoded 4K-UHD video for beyond-HDTV broadcasting services.’ IEEE Transactions on Broadcasting, 59(2), 2013, 209–222.
  274. Hanhart, P., Korshunov, P., and Ebrahimi, T., Benchmarking of quality metrics on ultra-high definition video sequences. International Conference on Digital Signal Processing, 2013.
  275. Babu, R.V., Bopardikar, A.S., Perkis, A., and Hillestad, O.I., ‘No-reference metrics for video streaming applications.’ International Workshop on Packet Video, 2004.
  276. Yasakethu, S.L.P., Hewage, C.T.E.R., Fernando, W.A.C., and Kondoz, A.M., ‘Quality analysis for 3D video using 2D video quality models.’ IEEE Transactions on Consumer Electronics, 54(4), 2008, 1969–1976.
  277. Bosc, E., Pepion, R., Le Callet, P., et al., ‘Towards a new quality metric for 3-D synthesized view assessment.’ IEEE Journal of Selected Topics in Signal Processing, 5(7), 2011, 1332–1343.
  278. Boev, A., Gotchev, A., Egiazarian, K., Aksay, A., and Akar, G.B., ‘Towards compound stereo-video quality metric: A specific encoder-based framework.’ IEEE Southwest Symposium on Image Analysis and Interpretation, 2006.
  279. Han, J., Jiang, T., and Ma, S., ‘Stereoscopic video quality assessment model based on spatial-temporal structural information.’ VCIP, 2012.
  280. Zhu, Z. and Wan, Y., ‘Perceptual distortion metric for stereo video quality evaluation.’ WSEAS Transactions on Signal Processing, 5(7), 2009, 241–250.
  281. Keimel, C., Oelbaum, T., and Diepold, K.J., ‘No-reference video quality evaluation for high-definition video.’ IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009.
  282. Chen, W., Fournier, J., Barkowsky, M., and Le Challet, P., ‘New requirements of subjective video quality assessment methodologies for 3DTV.’ Proceedings of VPQM, 2010.
  283. Kim, D.-S., ‘ANIQUE: An auditory model for single-ended speech quality estimation.’ IEEE Transactions on Speech and Audio Processing, 13(5), 2005, 821–831.
  284. Paillard, B., Mabilleau, P., Morisette, S., and Soumagne, J., ‘PERCEVAL: Perceptual evaluation of the quality of audio signals.’ Journal of the Audio Engineering Society, 40(1/2), 1992, 21–31.
  285. Rix, A.W., Hollier, M.P., Hekstra, A.P., and Beerends, J.G., ‘Perceptual evaluation of speech quality (PESQ), the new ITU standard for end-to-end speech quality assessment, Part I – Time-delay compensation.’ Journal of the Audio Engineering Society, 50(10), 2002, 755–764.
  286. Salmela, J. and Mattila, V.-V., ‘New intrusive method for the objective quality evaluation of acoustic noise suppression in mobile communications.’ Proceedings of the 116th Audio Engineering Society Convention, 2004.
  287. Steinmetz, R., ‘Human perception of jitter and media synchronization.’ IEEE Journal on Selected Areas in Communications, 14(1), 1996, 61–72.
  288. Arrighi, R., Alais, D., and Burr, D., ‘Perceptual synchrony of audiovisual streams for natural and artificial motion sequences.’ Journal of Vision, 6(3), 2006, 260–268.
  289. Lipscomb, S.D., ‘Cross-modal integration: Synchronization of auditory and visual components in simple and complex media.’ Proceedings of the Forum Acusticum, Berlin, 1999.
  290. Winkler, S. and Faller, C., ‘Perceived audiovisual quality of low-bitrate multimedia content.’ IEEE Transactions on Multimedia, 8(5), 2006, 973–980.
  291. Beerends, J.G. and de Caluwe, F.E., ‘The influence of video quality on perceived audio quality and vice versa.’ Journal of the Audio Engineering Society, 47(5), 1999, 355–362.
  292. Jones, C. and Atkinson, D.J., ‘Development of opinion-based audiovisual quality models for desktop video-teleconferencing.’ Proceedings of International Workshop on Quality of Service, Napa, CA, May 18–20, 1998, pp. 196–203.
  293. Ries, M., Puglia, R., Tebaldi, T., Nemethova, O., and Rupp, M., ‘Audiovisual quality estimation for mobile streaming services.’ Proceedings of International Symposium on Wireless Communication Systems, Siena, Italy, September 5–7, 2005.
  294. Jumisko-Pyykko, S., ‘I would like to see the subtitles and the face or at least hear the voice: Effects of picture ratio and audiovideo bitrate ratio on perception of quality in mobile television.’ Multimedia Tools and Applications, 36(1&2), 2008, 167–184.
  295. Hayashi, T., Yamagishi, K., Tominaga, T., and Takahashi, A., ‘Multimedia quality integration function for videophone services.’ Proceedings of the IEEE International Conference on Global Telecommunications, 2007, pp. 2735–2739.
  296. Goudarzi, M., Sun, L., and Ifeachor, E., ‘Audiovisual quality estimation for video calls in wireless applications.’ IEEE GLOBECOM, 2010.
  297. Hands, D.S., ‘A basic multimedia quality model.’ IEEE Transactions on Multimedia, 6(6), 2004, 806–816.
  298. Thang, T.C., Kang, J.W., and Ro, Y.M., ‘Graph-based perceptual quality model for audiovisual contents.’ Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'07), Beijing, China, July 2007, pp. 312–315.
  299. Thang, T.C., Kim, Y.S., Kim, C.S., and Ro, Y.M., ‘Quality models for audiovisual streaming.’ Proceedings of SPIE: Electronic Imaging, Vol. 6059, 2006, pp. 1–10
  300. Thang, T.C. and Ro, Y.M., ‘Multimedia quality evaluation across different modalities.’ Proceedings of SPIE: Electronic Imaging, Vol. 5668, 2005, pp. 270–279.

Acronyms

CSF

Contrast Sensitivity Function

DCT

Discrete Cosine Transform

DWT

Discrete Wavelet Transform

FFT

Fast Fourier Transform

FR

Full Reference

GMM

Gaussian Mixture Model

HAS

Human Auditory System

HDR

High Dynamic Range

HDTV

High-Definition Television

HEVC

High-Efficiency Video Coding

HVS

Human Visual System

IGM

Internal Generative Mechanism

JND

Just-Noticeable Difference

MAE

Mean Absolute Error

MMSE

Minimum Mean Square Error

MOS

Mean Opinion Score

MOV

Model Output Variable

MOVIE

Motion-Based Video Integrity Evaluation

MSE

Mean Square Error

MT

Middle Temporal

NR

No Reference

ODG

Overall Difference Grade

PEAQ

Perceptual Evaluation of Audio Quality

PSF

Point Spread Function

PSNR

Peak SNR

PSQM

Perceptual Speech Quality Measure

RR

Reduced Reference

SDTV

Standard-Definition Television

SNR

Signal-to-Noise Ratio

SSIM

Structural Similarity

SVD

Singular Value Decomposition

UHD

Ultra-High Definition

VA

Visual Attention

VDP

Visible Differences Predictor

VIF

Visual Information Fidelity

VSNR

Visual Signal-to-Noise Ratio

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset