Hong Ren Wu
RMIT, Australia
The discussions in previous chapters have highlighted an increasing emphasis on Quality of Experience (QoE) [1] compared with Quality of Service (QoS) [2] in audio-visual communication, broadcasting and entertainment applications, which signals a transition from technology-driven services to user-centric (or perceived) quality-assured services [3]. QoE as defined by ITU SG 121 is application or service specific and influenced by user expectations and context [1], and therefore necessitates assessments of perceived service quality and/or utility (or usefulness) of the service [4]. Subjective assessment and evaluation of the service are imperative to establish the ground truth for objective assessment and measures which aid in the design/optimization of devices, products, systems, and services in either the online or offline mode [1]. Assessment or prediction of QoE in multimedia services will have to take account of at least three major factors, including audio signal quality perception, visual signal quality perception, and interaction or integration of the perceived audio and visual signal quality [5–7], considering coding, transmission, and application/service conditions [3, 6, 8, 9]. This chapter focuses on the issues underpinning the theoretical frameworks/models and methodologies for QoE subjective and objective evaluation of visual signal communication services.
Subjective picture quality assessment methods for television and multimedia applications have been standardized [10, 11]. Issues relevant to human visual perception and quality scoring or rating are discussed in this chapter, while readers are referred to the standards documents and/or other monographs regarding specific details of the aforementioned standards [10–12].
The Human Visual System (HVS) can be modeled in either pixel domain or transform/sub-band decomposition domain, and the same can be said about picture quality metric formulations. Given a color image sequence or video as shown in Figure 6.1(a), x[n, i, ζ], N1 pixels high and N2 pixels wide, where n = [n1, n2] for 0 ⩽ n1 ⩽ N1 − 1 and 0 ⩽ n2 ⩽ N2 − 1 with I frames for 0 ⩽ i ⩽ I − 1 in tricolor space Ξ = {Y, CB, CR}2 (or Ξ = {R, G, B}3) for ζ ∈ {1, 2, 3} corresponding to, for example, Y, CB, CR (or R, G, B) channels (cf. Figure 6.1(d) or (c)) [13, 14], respectively, its transform or decomposition is represented by X = [k, b, j, ζ], as shown, for example, in Figure 6.1(b) for a Discrete Wavelet Transform (DWT) decomposition, where k = [k1, k2] defines the position (row and column indices) of a coefficient in a block of a frequency band b of slice j in the decomposition domain. For an s-level DWT decomposition, b = [s, θ], where θ ∈ {θ0|LL band, θ1|LH band, θ2|HL band, θ3|HH band} and s = 3 per frame, as shown in Figure 6.1(b). It is noted that, as shown in Figure 6.1(c) and (d), three component channels in the YCBCR color space are better decorrelated than those in the RGB color space, facilitating picture compression. Other color spaces, such as opponent color spaces, which are considered perceptually uniform4 [15], have often been used in color picture quality assessment [16–18].
Quantitative assessment of visual signal5 quality as perceived by the HVS inevitably involves human observers participating in subjective tests which elicit quality ratings using one scale or another [4, 10–12]. Quantification of subjective test results is usually based on the Mean Opinion Score (MOS),6 which indicates the average rating value qualified by a measure of deviation (e.g., standard deviation, variance, minimum and maximum values, or a confidence interval), acknowledging the subjectivity and statistical nature of the assessment [10–12]. This section discusses a number of issues associated with human visual perception, which affect subjective rating outcomes and, thereafter, their reliability and relevance when used as the ground truth for objective or computational metric designs. Models or approaches to computational QoE metric designs to date [19–27] are analyzed to appreciate their theoretical grounding and strength in terms of prediction accuracy, consistency, and computational complexity, as well as their limitations, with respect to QoE assessment and prediction for picture codec design, network planning and performance optimization, and service performance assessment.
Low-level human vision has been characterized by spatial, temporal and color vision, and visual attention (or foveation). There are well-formulated HVS models which have been successfully used in Image (or Video) Quality Assessment (IQA or VQA) and evaluation [19–21] and perception-based picture coder designs [3]. It is noted that the majority of these models have their parameters derived from a threshold vision test to estimate or predict the Just-Noticeable Difference (JND) [28], with a few exceptions [29, 30]. When these threshold vision models are extended to supra-threshold experiments where most of the visual communications, broadcast, and entertainment applications to date apply [3], the selection of the model parameters relies heavily on a regression process to achieve the best fit to subjective test data [18]. Relevancy, accuracy, and reliability of subjective test data therefore directly affect the performance of objective QoE metrics [4, 7, 18, 25]. Four issues have emerged over the years of acquiring ground-truth subjective test data in terms of perceived picture quality and QoE, and are worth noting.
First and foremost, the picture quality perceived and thereafter the ratings given by human observers are affected by what they have seen or experienced prior to a specific subjective test. Contextual effects7 (considered as short-term or limited test sample pool8 effects) have been reported using standardized subjective test methods [31]. Affordability notwithstanding, users' expectations are understandably influenced by their benchmark experience or point of reference in what constitutes the “best” picture quality they have seen or experienced. This benchmark experience (long-term or global sample pool effects) will derive subsequent ratings on a given quality scale. For an observer who had never viewed or experienced, for example, an uncompressed YCBCR 4:4:4 component color video of Standard Definition (SD) [13] or full High Definition (HD) [14] on a broadcast video monitor designed for critical picture evaluation,9 it would be highly uncertain what response one could hope to elicit from the observer when asked if any chroma distortion was present in the 4:2:2 or 4:2:0 component video in the subjective test. While chroma sub-sampling has been widely used in video products to take advantage of HVS' insensitivity to chroma component signals as an initial step to image data compression, the chroma distortions so caused are not always negligible as commonly believed, especially using quality (e.g., broadcast or professional-grade) video monitors. Figure 6.2 uses contrast-enhanced10 difference images between the uncompressed Barbara test image in component YCBCR 4:4:4 format [13, 14] and those chroma sub-sampled images in component 4:2:2, 4:2:0, and 4:11 formats, respectively, to illustrate chromatic distortions. The same thing may be said about responses from observers used in an image or video quality subjective evaluation test, to whom various picture coding artifacts or distortions [32–34] are unknown. Issues associated with the consistency and reliability of subjective test results as reported or reviewed (cf., e.g., [7, 35]) aside, subjective picture quality assessment data collected from observers with limited benchmark experience is deemed not to have a reliable/desirable anchor (or reference) point, making the analysis results difficult to interpret or inconclusive [7], if not questionable, and does not inspire confidence in the application to objective quality metric design and optimization. In other words, using observers with minimum knowledge, experience, or expectations in subjective picture quality evaluation tests generates data with varying or no reference point, often then lowering expectations of what is considered as “Excellent” picture quality, and possibly running a real risk of racing to the bottom in the practice of quality assessment, control, and assurance for visual signal communications, broadcast, and entertainment applications.
Second, it has long been acknowledged that human perception and judgment in a psychophysical measurement task usually perform better in comparison tasks than casting an absolute rating.11 Nevertheless, an Absolute Category Rating (ACR) or absolute rating scale has been used widely in subjective picture quality evaluations [7, 10–12]. To address the issue regarding fluctuations in subjective test data using absolute rating schemes due to the aforementioned contextual effects and the varying experience and expectations of observers, a Multiple Reference Impairment Scale (MRIS) subjective test method for digital video was
reported in [36], where a five-reference impairment scale (R5 to R1) was used with five reference pictures including the original, xo, as uncorrupted picture reference, xR5 (R5), reference distorted pictures defined, respectively, as xR4, xR3, xR2, and xR1 in terms of their perceptibility of impairment corresponding to perceptible but not annoying (R4), slightly annoying (R3), annoying (R2), and very annoying (R1). The observers compared the processed picture xp with the original xR5 to determine if the impairment was perceptible, or with xRi for i ∈ {1, 2, 3, 4} when there was a perceptible distortion to rate xp as better, similar, or worse than xRi. This approach led to a comparative rating scale based on forced choice methods, which significantly reduced the deviation in the subjective test data and alleviated the contextual effects. Ideally, following this conventional distortion-detection strategy, each of the reference distortion scales as represented by xRi will be better off corresponding to JND levels [3, 36] or Visual Distortion Units (VDUs) [29].
Third, an issue not all together disassociated with the second is the HVS's response under two distinctive picture quality assessment conditions: where artifacts and distortions are at visibility sub-threshold or around the threshold (usually found in high-quality pictures) and at supra-threshold (commonly associated with medium and low-quality pictures). HVS models based on threshold vision test and detection of the JND have been available and widely adopted in objective picture quality/distortion measures/metrics [19–27]. While picture distortion measures based on these models have been successfully employed in perceptually lossless picture coding [3, 37, 38], applications of JND models to picture processing, coding, and transmission or storage at supra-threshold levels have revealed that the supposition of linear scaling JND models is not fully supported by various experimental studies [15, 29, 39] and, therefore, further investigations are required to identify, delineate, and better model HVS characteristics/responses under supra-threshold conditions [29, 30, 35, 39, 40]. It was argued in [30] that for assessment of a high-quality picture with distortions near the visibility threshold, the HVS tends to look past the picture and perform a distortion detection task, whilst for evaluation of a low-quality picture with obviously visible distortions of highly supra-threshold nature, the HVS tends to look past (or to be more forgiving toward) the distortions and look for the content of the picture. This hypothesis is consistent with the HVS behavior as revealed by contextual effects. To cover a wide range of picture quality as perceived or experienced by human observers, HVS modeling and quantitative QoE measurement may need to take account of two, instead of one, assessment strategies which the HVS seems to adopt: distortion detection, which is commonly used in threshold vision tests and gradation of degradations of image appearance, which is practiced in supra-threshold vision tests [30].
Fourth, it is becoming increasingly clear that the assessment of QoE requires more than the evaluation of picture quality alone, with a need to differentiate the measurement of the perceived resemblance of a picture at hand to the original from that of the usefulness of the picture to an intended task. It was reported in [4] that QoP (perceived Quality of Pictures) in a five-scale ACR and UoP (perceived Utility of Pictures) anchored by Recognition Threshold (RT, assigned “0”)12 and Recognition Equivalence Class (REC, assigned “100”)13 could be approximated by a nonlinear function, and that QoP did not predict UoP well and vice versa. Detailed performance comparisons have been reported in [4] of natural scene statistical model [27] and image feature-based QoP and UoP metrics. Compared with subjective test methodology and procedures for QoP assessments which have been standardized over the years [10–12], subjective tests for UoP assessments for various specific applications may face further challenges ranging from the most critical to minimal efforts, and require participation of targeted human observers who have the necessary domain knowledge of intended applications, (e.g., radiologists and radiographers in medical diagnostic imaging) [37].
QoP and QoE assessments are not just for their own sakes, and they are linked closely to visual signal compression and transmission where Rate–Distortion (R-D) theory is applied for product, system, and service quality optimization [41–43]. From an R-D optimization perspective [44–46], it is widely understood that the use of raw mathematical distortion measures, such as the Mean Squared Error (MSE), do not guarantee visual superiority since the HVS does not compute the MSE [3, 47]. In RpD (Rate-perceptual-Distortion) optimization [48], where perceptual distortion or utility measures matter, the setting of the rate constraint, Rc, in Figure 6.3 is redundant from a perceptual distortion controled coding viewpoint. The perceptual bit-rate constraint, Rpc, makes more sense and delivers a picture quality comparable with JND1. In comparison, Rc is neither sufficient to guarantee a distortion level at JNND (Just-Not-Noticeable Difference) nor necessary to achieve, for example, JND1 in Figure 6.3. By the same token, Dc is ineffective at holding a designated visual quality appreciable to the HVS since it cannot guarantee JND2 nor is it necessary to deliver JND3. As the entropy defines the lower bound of the bit rate required for information lossless picture coding [49, 50], the perceptual entropy [48] sets the minimum bit rate required for perceptually lossless picture coding [37, 38]. Similarly, in UoP-regulated picture coding in terms of a utility measure, utility entropy can be defined as the minimum bit rate required to reconstruct a picture and achieve complete feature recognition equivalent to perceptually lossless pictures, including the original as illustrated in Figure 6.3.
Objective QoP measures or metric designs for the purpose of QoE assessment can be classified based on the model and approach which they use and follow. The perceptual distortion or perceived difference between the reference and the processed visual signal can be formulated by applying the HVS process either to the two signals individually before visually significant signal differences are computed, or to the differences of the two signals in varying forms to weigh up their perceptual contributions to the overall perceptual score [19, 21].
A feature extraction-based approach to picture quality metric design formulates a linear or nonlinear cost function of various distortion measures using features extracted from given reference and processed pictures, considering aspects of the HVS (e.g., Contrast Sensitivity Function (CSF), luminance adaption, and spatiotemporal masking effects), and optimizing coefficients to maximize the correlation of picture quality/distortion estimate with the MOS from subjective test data.
An objective Picture Quality Scale (PQS) was introduced by Miyahara in [51] and further refined in [52]. The design philosophy of PQS is summarized in [53], which leads to a metric construct consisting of the generation of visually adjusted and/or weighted distortion and distortion feature maps (i.e., images), the computation and normalization of distortion indicators (i.e., measures), decorrelated principal perceptual indicators by Principal Decomposition Analysis (PDA), and pooling principal distortion indicators with weights determined by multiple regression analysis to fit subjective test data (e.g., MOS) to form the quality estimator (i.e., PQS in this case). Among the features considered by the PQS are a luminance coding error, considering the contrast sensitivity and brightness sensitivity described by Weber–Fechner's law [15], a perceptible difference normalized as per Weber's law, perceptible blocking artifacts, perceptible correlated errors, and localized errors of high contrast/intensity transitions by visual masking. The PDA is used to decorrelate any overlap between these distortion indicators based on feature maps which are more or less extracted empirically, and omitted in many later distortion metric implementations only to be compensated by the regression (or optimization) process in terms of the least-mean-square error, linear correlation, or some other measure [18].
A similar approach was followed by an early representative video quality assessment metric by ITS14 [54], (s-hat),15 leading to the standardized Video Quality Metric (VQM) in the ANSI and the ITU-T objective perceptual video quality measurement standards [16, 55]. Seven parameters (including six impairment indicators/measures16 and one picture quality improvement indicator/measure17) are used in linear combination to form the general VQM model with parameters optimized using the iterative nested least-squares algorithm to fit against a set of subjective training data. The general VQM model was reported in [55] to have performed statistically better than, or at least equivalent to, other models recommended in [16] in either the 525-line or 625-line video test.
Various picture distortion or quality metrics designed using this approach rely on extraction of spatial and/or temporal features, notably edge features [52, 55, 87], which are deemed to be visually significant in the perception of picture quality, and a pooling strategy for formulation of an overall distortion measure with parameters optimized by a regression process to fit a set of subjective test data.
The Natural Scene Statistics (NSS) model-based approach to QoP measurement is based on the hypothesis that modeling natural scenes and modeling HVS are dual problems, and QoP can be captured by NSS [27, 56]. Of particular interest are the Structural Similarity Index (SSIM) [57] and its variants, the Visual Information Fidelity (VIF) measure [58] and the texture similarity measure [59], the former two of which have been highly referenced and used in QoP performance benchmarking in recent years, as well as frequently applied to perceptual picture coding design using RpD optimization [3].
Formulation of the SSIM is based on the assumption that structural information perception plays an important role in perceived QoP by the HVS and structural distortions due to additive noise, low-pass-filtering-induced blurring, and other coding artifacts affecting perceived picture quality more than non-structural distortions such as a change in brightness and contrast, spatial shift or rotation, or a Gamma correction or change [47]. The SSIM replaces pixel-by-pixel comparisons with comparisons of regional statistics [57]. The SSIM for monochrome pictures measures the similarity between a reference/original image, xref[n],18 and a processed image, xp[n], N1 pixels high and N2 pixels wide, in luminance as approximated by picture mean intensities, and , contrast as estimated by picture standard deviations, and , and structure as measured by the cross-correlation coefficients between xref[n] and xp[n], . It is defined as follows [57]:
where the luminance, contrast, and structure similarity measures are, respectively,
and
al, ac, and as are constants to avoid instability, with values selected proportional to the dynamic range of pixel values; α > 0, β > 0, and γ > 0 are parameters to define the relative importance of the three components, and for the vector index set, , encompassing all pixel locations of xref[n] and xp[n], with card( · ) denoting the cardinality of a set,
and
To address the issues with the non-stationary nature of spatial (and temporal) picture and distortion signals, as well as the visual attention of the HVS, the SSIM is applied locally (e.g., to a defined window), leading to windowed SSIM, SSIMW(xref, xp). Sliding this window across the entire picture pixel-by-pixel will result in, for example, a total number of M SSIMW values for each pixel location in an entire picture. The overall SSIM is then computed as the average of the relevant SSIMs, as follows:
An 11 × 11 circular-symmetric Gaussian weighting function with standard deviation of 1.5 samples normalized to a unity sum was used in [57] for computation of the mean, standard deviation, and cross deviation in (6.5)–(6.7), respectively, to avoid blocking artifacts in the SSIM map. The SSIM has been extended to color images [60] and video [61, 90].
VIF formulation takes an information-theoretic approach to QoP assessment, where mutual information is used as the measure formulating source (natural scene picture statistics) model, distortion model, and HVS “visual distortion”19 model. As shown in Figure 6.4, a Gaussian Scale Mixture (GSM) model, , in the wavelet decomposition domain20 is used to represent the reference picture. A Random Field (RF), , models the attenuation such as blur and contrast changes, and additive noise of the channel and/or coding which represent equal perceptual annoyance from the distortion instead of modeling specific image artifacts. All HVS effects are considered as uncertainty and treated as visual distortion, which is modeled as a stationary, zero-mean, additive white Gaussian noise model, , corresponding to reference (or for processed) in the wavelet domain. For a selected sub-band b, where b = [s, θ] with level s and orientation θ, in the wavelet transform domain, the VIF measure is defined as
where the mutual information between the reference image and the perceived image in the same sub-band b is defined as H(CN[b]; EN[b]|ξN[b]), with ξN[b] being a realization of N elements in for a given reference image, and that between the processed image and the image perceived by HVS is H(CN[b]; FN[b]|ξN[b]), with , , , , , and the selected sub-band critical for VIF computation.
When there is no distortion, the VIF equals unity. When the VIF is greater than unity, the processed picture is perceptually superior to the reference picture, as may be the case in a visually enhanced picture.
The Structure Texture Similarity Metric (STSIM) measures perceived texture similarity between a reference picture and a processed counterpart to address an issue with SSIM, which tends to give low similarity values to textures which are perceptually similar. The framework used by the STSIM consists of sub-band decomposition (e.g., using steerable filter banks), computation of a set of statistics including the mean, variance, horizontal and vertical autocorrelations, and cross-band correlation, statistical comparisons and pooling scores across statistics, sub-bands and window positions. More detailed discussions and reviews of various textural similarity metrics can be found in [59].
The HVS model-based approach devises picture quality metrics to simulate the human visual perception using a model to characterize low-level vision for picture quality estimation, in terms of spatial vision, temporal vision, color vision, and foveation. Three types of HVS model have emerged, including JND models, multichannel CGC models, and supra-threshold models, which have been applied successfully to picture quality assessment and perceptual picture coding design using RpD optimization [3]. The multichannel structure of the HVS decomposes the visual signal into several spatial, temporal, and orientation bands where masking parameters will be determined based on human visual experiments [16, 18].
The HVS cannot perceive all changes in an image/video, nor does it respond to varying changes in a uniform manner [15, 63]. In picture coding, JND threshold detection-based HVS models are reported extensively [19–26, 28] and used in QoP assessment, perceptual quantization for picture coding, and perceptual distortion measures in RpD performance optimization for visual signal processing and transmission services [3].
The JND models reported currently in the literature consider (1) spatial/temporal CSF, which describes the sensitivity of the HVS to each frequency component, as determined by psychophysical experiments; (2) background Luminance Adaptation (LA), which refers to how the contrast sensitivity of the HVS changes as a function of the background luminance; and (3) Contrast Masking (CM), which refers to the masking effect of the HVS in the presence of two or more simultaneous frequency components. The JND model can be represented in either the spatiotemporal domain or the transform/decomposition domain, or both. Examples of JND models are found with CSF, CM, and LA modeling in the DCT domain [64–67], and CSF and CM modeling using sub-band decomposition [68–71]; or in the pixel domain [72], where the key issue is to differentiate edge from textured regions [73].
A general luminance JND modeling in the sub-band decomposition domain is given by [3, 26]
where is the base visibility threshold at the location k in sub-band b of frame j determined by spatiotemporal CSF, and , ℘ ∈ {intra, inter, temp, lum, …}, represents different elevation factors due to intra-band (intra) masking, inter-band (inter) masking, temporal (temp) masking, luminance (lum) adaptation, and so on. The frame index j is redundant for single-frame images.
It is well known that HVS sensitivity reaches its maximum at the fovea over two degrees of the visual angle and decreases toward the peripheral retina, which spans 10–15° of visual angle [74]. While JND accounts for the local response, Visual Attention (VA) models the global response. In the sub-band decomposition domain, Foveated JND (FJND) can be modeled as follows [3]:
where JNDSD[k, b] is defined in (6.10), denotes the modulatory function determined by V[k] and usually taking a smaller value with larger V[k], which denotes the VA estimation corresponding to spatial frequency location k in block b. JND is a special case of FJND when VA is not considered and reduces to unity (i.e., ).
There are two approaches to JND modeling for color pictures (i.e., modeling of Just-Noticeable Color Difference (JNCD)). Each color component channel can be modeled independently in a similar way to that in which the luminance JND model is formulated. Alternatively, JNCD can be modeled by a base visibility threshold of distortion for all colors, JNCD00(n),21 modulated by the masking effect of the non-uniform neighborhood (measured by the variance) represented by and a scale function , modeling the masking effect induced primarily by local changes of luminance (measured by the gradient of the luminance component), assuming that the CIELAB color space Ξ = {L, a, b} is used, and ζ = 1 corresponds to the L component, as follows:
where n is the pixel coordinate vector in a pixel domain formulation.
Based on the JND model, a Peak Signal-to-Perceptual-Noise Ratio (PSPNR) was devised in [76] as follows:
where , xref and xrec are the reference and the reconstructed pictures, respectively,
In (6.16), the luminance adaptation factor JNDpL[n] at pixel location n can be decided according to the luminance in the pixel neighborhood; the texture masking factor JNDpT[n] can be determined via the weighted average of gradients around n [72] and refined with more detailed signal classification [73]; κ accounts for the overlapping effect between JNDpL and JNDpT, and 0< κ ≤ 1. For video, factor JNDp[n] can be multiplied further by an elevation factor as in (6.15) to account for the temporal masking effect, which is depicted by a convex function of inter-frame change formulated in (6.17) [76]:
where x[n, i] denotes the pixel value of the ith frame and xBG[n, i] the average background luminance of the ith frame.
When .JNDST[n, i, ζ]|∀ζ ≡ 0 in (6.13), the PSPNR reduces to the PSNR.
A Visual Signal-to-Noise Ratio (VSNR) is devised in [77] using a wavelet-based visual model of masking and summation, which claims low computational complexity and low memory requirements.
Contrast Gain Control (CGC) [78] has been used successfully in varying implementations for JND detection, QoP assessment, and perceptual picture coding in either standalone or embedded forms [18–21, 37, 43, 79–82], with Picture Quality Rating (PQR) extended from the original Sarnoff's Visual Discrimination Model–JNDmetrix™ [83, 84], extensively documented in ITU-T J.144 recommendation and frequently used as a benchmark [16].
An example of the CGC model in the visual decomposition domain used in [82] is described briefly for embedding a perceptual distortion measure in RpD optimization of a standard compliant coder. As shown in Figure 6.5, it consists of a frequency transform (with 9/7 filter) [85], CSF weighting, intra-band and inter-orientation masking, detection and pooling.
Given the Mallat DWT decomposition [86] of an image, x[n, ζ], for , XDWT[k, b, z], where b = [s, θ] defines the decomposition level or scale s ∈ {1, 2, ..., 5} representing five levels and orientation θ ∈ Θ = {θ0|LL band, θ1|LH band, θ2|HL band, θ3|HH band} representing three orientations, and k = [k1, k2] with k1 and k2 as the row and column spatial frequency indices within the band specified by b, the CGC model for a designated color channel has a masking response function of the form [82]
where ζ is assumed to be 1 (representing the luminance Y component) and omitted to simplify the mathematical expressions, Ez[k, b] and Iz[k, b] are the excitation and inhibition functions, ρz and σz are the scaling and saturation coefficients, and z ∈ {Θ, ϒ}, with Θ and ϒ specifying inter-orientation and intra-frequency masking domains, respectively.
The excitation and inhibition functions of the two domains (i.e., z ∈ {Θ, ϒ}) are given as follows:
and
where the exponents pz and q represent, respectively, the excitatory and inhibitory nonlinearities and are governed by the condition pz > q > 0 according to [78], Ms(k) is a neighborhood area surrounding XCSF[k, b], whose population is dependent on the frequency level, s = {1, 2, 3, 4, 5} (from lowest to highest; cf. Figure 6.1(b)), such that card(Ms(k)) = (2s + 1)2, and XCSF[k, b] contains the weighted transform coefficients, accounting for the CSF and defined as
In (6.21), represents the sum of transformed coefficients spanning all oriented bands. The variation in neighborhood windowing associated with in (6.22) addresses the uneven spatial coverage between different resolution levels in a multi-resolution transform. The spatial variance, λ2, in (6.22) has been added to the inhibition process to account for texture masking [74]. where Lλ denotes the code block and μ the mean of Lλ. In (6.23), Wδ for δ ∈ {LL, 1, 2, ..., 5} represents six CSF weights, one for each resolution level plus an additional weight for the isotropic (LL) band and [76]. is the base visibility threshold at the location k in sub-band b determined by spatiotemporal CSF.
In [82], a simple squared-error (or l2-norm-squared) function is used to detect the visual difference between the visual masking responses of the reference, XRefz[b, k], and processed CSF-weighted DWT coefficients, XProz[b, k], respectively, to form a perceptual distortion measure PDM as
Here, gz is the gain factor associated with inter-orientation () and intra-frequency () masking. In (6.24), is computed separately, since the LL band contains a substantial portion of the image energy in the transform domain, exhibiting a higher level of sensitivity to changes than that of all oriented bands at all resolution levels:
Here, is a scaling constant, and are, respectively, the processed and reference visually weighted DWT coefficients for the LL band of the lowest resolution level, and is as defined in (6.18).
A wide range of picture processing and compression applications require cost-effective solutions and belong to, more often than not, the so-called supra-threshold domain, where processing distortions or compression artifacts are visible. A supra-threshold wavelet coefficient quantization experiment reported that the first three visible differences (relative to the original image) are well predicted by an exponential function of sub-band standard deviation, and regression lines with respect to JND2 and JND3 are parallel to that of JND1 [29].
The Most Apparent Distortion (MAD) measures supra-threshold distortion using a detection model and appearance model in the form of [30]
where Ddetection is the perceived distortion due to visual detection, which is formulated in a similar way to JND models, and Dappearance is a visual appearance-based distortion measure based on changes in log-Gabor statistics such as the standard deviation, skewness, and kurtosis of sub-band coefficients, and is weight adapted to the severity of the distortion as measured by Ddetection:
with β1 = 0.467 and β2 = 0.130.
In real-time visual communications, broadcasting, and entertainment services, QoE assessment and monitoring tasks face various constraints such as availability of full or partial information on reference pictures, computation power, and time. While no-reference picture quality metrics provide feasible solutions [25, 88], investigations have been prompted into lightweight QoE methods and associated standardization activities. There are at least three identifiable models, including the parametric model, packet-layer model, and bit-stream-layer model. With very limited information acquired or extracted from the transmission payload, stringent transmission delay constraints and limited computational resources, these models share a common technique – that is, optimization of perceptual quality or distortion predictors via, for example, regression or algorithms of similar trade using ground truth subjective test data (e.g., MOS or DMOS) and optimization criteria such as Pearson linear correlation, Spearman rank-order correlation, outlier ratio, and Root Mean Square Error (RMSE) [18, 25].
Relying on Key Performance Indicators (KPIs) collected by network equipment via statistical analysis, a crude prediction of perceived picture quality or distortion is formulated using (bit) Rate (R) and Packet Loss Rate (PLR) along with side information (e.g., codec type and video resolution), which may be used to assist the adaptation of model parameters to differently coded pictures. Since the bit rate does not correlate well with the MOS data for pictures of varying content, packet loss occurring at different locations in a bit stream may have significantly different impact on perceived picture quality [3]; the quality estimation accuracy based on this model is limited, while the computation required is usually trivial.
With more information available via packet header analysis, distortions at picture frame level can be estimated better with information on coding parameters such as frame type and bit rate per frame, frame rate and position of lost packets, as well as PLR. The temporal complexity of video content can be estimated using ratios between bit rates of different type of frame. This enables temporal pooling for better quality or distortion prediction.
By accessing the media payload as well as packet-layer information, this model allows picture quality estimation either with or without pixel information [21, 88].
Evaluation of QoP for visual communication, broadcasting, and entertainment services may be conducted at different points between the source and the receiver [6, 9, 25, 88] for the purpose of product, system, or service provider quality control, monitoring and regulation/optimization, performance benchmarking; QoP monitoring and regulation/optimization along transmission path(s) within the network (e.g., at nodes) [89]; and QoP advisory and feedback at the receiver. The suitability of QoP measures based on various models and approaches to software or hardware online or offline performance evaluation depends on the measurement point/location in the encoding, transmission, and decoding chain, the availability of the reference video sequence(s), obtainable hardware and/or software computing resources, and the computational complexity of the QoP metrics. Table 6.1 shows the feasibility of QoP measures based on various models and approaches for online or offline assessments.
To conclude this chapter, a number of observations can be made with respect to the current state of play in QoE for visual signal compression and transmission.
First, HVS model-based quality metrics have higher computational complexity than feature-driven-based, NSS-based, or lightweight quality measures, which makes software online solutions to QoE assessment all but impractical based on current computing technologies, if not entirely impossible, for most quality monitoring applications. Hardware online solutions have been demonstrated for full-reference quality assessments, which have a higher degree of system complexity and incur considerably more cost compared with alternative approaches.
Second, existing IQA and VQA metrics [16] have demonstrated their ability and success in grading the quality of pictures, corresponding to the traditional ACR subjective test data [11, 16]. However, it remains a challenge whether or not these metrics can be equally effective and able to produce accurate and robust values which correspond to JNND, JND1, JND2, etc., respectively, for quality-driven perceptually lossless and/or perceptual quality-regulated coding and transmission applications.
Table 6.1 Feasibility of QoP measures for online or offline assessment On-: online evaluation: Off-: offline evaluation: Y: suitable; N: unsuitable.
Encoding | Network nodes | Decoding | ||||||||||||||||
Coder R-DO | Coder Evaluation | Evaluation | Evaluation | |||||||||||||||
Hardware | Software | Hardware | Software | Hardware | Software | Hardware | Software | |||||||||||
Computation | ||||||||||||||||||
Type of metrics | On- | Off- | On- | Off- | On- | Off- | On- | Off- | On- | Off- | On- | Off- | On- | Off- | On- | Off- | Complexity | |
HVS model | JND model based | Y | Y | N | Y | Y | Y | Y | Y | Moderate to high | ||||||||
Multichannel model based | Y | Y | N | Y | Y | Y | N | Y | High | |||||||||
Suprathreshold vision model based | Moderate to high | |||||||||||||||||
Feature | PQS | Y | Y | N | Y | Y | Y | N | Y | Moderate to high | ||||||||
s-hat | Y | Y | N | Y | Y | Y | N | Y | Moderate | |||||||||
VQM | Y | Y | N | Y | Y | Y | N | Y | Moderate | |||||||||
NSS | SSIM | Y | Y | Y | Y | Y | Y | Y | Moderate | |||||||||
Model | VIF | Y | Y | Y | Y | Y | Y | High | ||||||||||
STSIM | Y | Y | Y | Y | Y | Y | Moderate | |||||||||||
Light weight | Parametric model | Y | Y | Y | Y | Low | ||||||||||||
Packet layer model | Y | Y | Y | Y | Low | |||||||||||||
Bit-stream layer model | Y | Y | Y | Y | Y | Y | Y | Y | Low to moderate |
Third, there has been an obvious lack of reports on HVS modeling and perceptual distortion measures which capture 3-D video coding artifacts and distortions for 3-D visual signal coding and transmission applications.
Fourth, there have been very limited investigations into QoE assessment which integrates audio and visual components beyond the preliminary based on human perception and integrated human audiovisual system modeling [6, 7].
Significant theoretical and practical contributions to QoE research and development are required to complete the ongoing transition in audiovisual communications, broadcasting, and entertainment systems, and applications from the best-effort rate-driven technology-centric service to a quality-driven user-centric quality-ensured experience [3].
H.R. Wu is indebted to all his past and present collaborators and co-authors of joint publications relevant to the subject matter for their invaluable contributions to the material this chapter is sourced from and based on. Special thanks go to Professor W. Lin of Nanyang Technological University, Singapore, Dr. A.R. Reibman of AT&T Research, USA, Professor F. Pereira of Instituto Superior Tecnico-Insituto de Telecomunicacoes, Portugal, Professor S.W. Hemami of Northeastern University, USA, Professor F. Yang of Xidian University, China, Professor S. Wan of Northwestern Polytechnical University, China, Professor L.J. Karam of Arizona State University, USA, Professor K.R. Rao of University of Texas at Arlington, USA, Dr. D.M. Tan of HD2 Technologies Pty Ltd, Australia, Dr. D. Wu of HD2 Technologies Pty Ltd, Australia, Dr. T. Ferguson of Flexera Software, Australia, and Dr. C.J. van den Branden Lambrecht.
Discrete Wavelet Transform
Human Visual System
Image Quality Assessment
Multiple Reference Impairment Scale
Natural Scene Statistics
Principal Decomposition Analysis
objective Picture Quality Scale
Quality of Experience
perceived Quality of Picture
Quality of Service
Recognition Equivalence Class
Recognition Threshold
perceived Utility of Picture
Visual Distortion Unit
Video Quality Assessment