Ozgur Oyman, Vishwanath Ramamurthi, Utsaw Kumar, Mohamed Rehan and Rana Morsi
Intel Corporation, USA
With the introduction of smartphones like the iPhone™ and Android™-based platforms, the emergence of new tablets like the iPad™, and the continued growth of netbooks, ultrabooks, and laptops, there is an explosion of powerful mobile devices in the market which are capable of displaying high-quality video content. In addition, these devices are capable of supporting various video-streaming applications, interactive video applications like video conferencing, and can capture video for video-sharing, video-blogging, video-Twitter™, and video-broadcasting applications. Cisco predicts that mobile traffic will grow by a factor of 11 until 2018, and that this traffic will be dominated by video (so, by 2018, over 66% of the world's mobile traffic will be video).1 As a result, future wireless networks will need to be optimized for the delivery of a range of video content and video-based applications.
Yet, video communication over mobile broadband networks today is challenging due to limitations in bandwidth and difficulties in maintaining the high reliability, quality, and latency demands imposed by rich multimedia applications. Even with the migration from 3G to 4G networks – or Radio Access Networks (RANs) and backhaul upgrades to 3G networks – the demand on capacity for multimedia traffic will continue to increase. As subscribers take advantage of new multimedia content, applications, and devices, they will consume all available bandwidth and still expect the same quality of service that came with their original service plans – if not better. Such consumer demand requires exploration of new ways to optimize future wireless networks for video services toward delivering higher user capacity to serve more users and also deliver enhanced Quality of Experience (QoE) for a rich set of video applications.
One of the key video-enhancing solutions is adaptive streaming, which is an increasingly promising method to deliver video to end-users, allowing enhancements in QoE and network bandwidth efficiency. Adaptive streaming aims to optimize and adapt the video configurations over time in order to deliver the best possible quality video to the user at any given time, considering changing link or network conditions, device capabilities, and content characteristics. Adaptive streaming is especially effective in better tackling the bandwidth limitations of wireless networks, but also allows for more intelligent video streaming that is device-aware and content-aware.
Most of the expected broad adoption of adaptive streaming will be driven by new deployments over the existing web infrastructure based on the HyperText Transfer Protocol (HTTP) [1], and this kind of streaming is referred to here as HTTP Adaptive Streaming (HAS). HAS follows the pull-based streaming paradigm, rather than the traditional push-based streaming based on stateful protocols such as the Real-Time Streaming Protocol (RTSP) [2], where the server keeps track of the client state and drives the streaming. In contrast, in pull-based streaming such as HAS, the client plays the central role by carrying the intelligence that drives the video adaptation (i.e., since HTTP is a stateless protocol). Several important factors have influenced this paradigm shift from traditional push-based streaming to HTTP streaming, including: (i) broad market adoption of HTTP and TCP/IP protocols to support the majority of Internet services offered today; (ii) HTTP-based delivery avoids Network Address Translation (NAT) and firewall traversal issues; (iii) a broad deployment of HTTP-based (non-adaptive) progressive download solutions already exists today, which can conveniently be upgraded to support HAS; and (iv) the ability to use standard/existing HTTP servers and caches instead of specialized streaming servers, allowing for reuse of the existing infrastructure and thereby providing better scalability and cost-effectiveness. Accordingly, the broad deployment of HAS technologies will serve as a major enhancement to (non-adaptive) progressive download methods, allowing for enhanced QoE enabled by intelligent adaptation to different link conditions, device capabilities, and content characteristics.
As a relatively new technology in comparison with traditional push-based adaptive streaming techniques, deployment of HAS techniques presents new challenges and opportunities for content developers, service providers, network operators, and device manufacturers. One such important challenge is developing evaluation methodologies and performance metrics to accurately assess user QoE for HAS services, and effectively utilizing these metrics for service provisioning and optimizing network adaptation. In that vein, this chapter provides an overview of HAS concepts and recent Dynamic Adaptive Streaming over HTTP (DASH) standardization, and reviews the recently adopted QoE metrics and reporting framework in Third-Generation Partnership Project (3GPP) standards. Furthermore, we present an end-to-end QoE evaluation study on HAS conducted over 3GPP LTE networks and conclude with a discussion of future directions and challenges in QoE optimization for HAS services.
HAS has already been spreading as a form of Internet video delivery, with the recent deployment of proprietary solutions such as Apple HTTP Live Streaming, Microsoft Smooth Streaming, and Adobe HTTP Dynamic Streaming.2 In the meantime, the standardization of HAS has also made great progress, with the recent completion of technical specifications by various standards bodies. In particular, DASH has recently been standardized by Moving Picture Experts Group (MPEG) and 3GPP as a converged format for video streaming [1, 2], and the standard has been adopted by other organizations including Digital Living Network Alliance (DLNA), Open IPTV Forum (OIPF), Digital Entertainment Content Ecosystem (DECE), World-Wide Web Consortium (W3C), and Hybrid Broadcast Broadband TV (HbbTV). DASH today is endorsed by an ecosystem of over 50 member companies at the DASH Industry Forum. Going forward, future deployments of HAS are expected to converge through broad adoption of these standardized solutions.
The scope of both MPEG and 3GPP DASH specifications [1, 2] includes a normative definition of a media presentation or manifest format (for DASH access client), a normative definition of the segment formats (for media engine), a normative definition of the delivery protocol used for the delivery of segments, namely HTTP/1.1, and an informative description of how a DASH client may use the provided information to establish a streaming service. This section will provide a technical overview of the key parts of the DASH-based server–client interfaces, which are part of MPEG and 3GPP DASH standards. More comprehensive tutorials on various MPEG and 3GPP DASH features can be found in [3–5].
The DASH framework between a client and web/media server is depicted in Figure 4.1. The media preparation process generates segments that contain different encoded versions of one or several media components of the media content. The segments are then hosted on one or several media origin servers, along with the Media Presentation Description (MPD) that characterizes the structure and features of the media presentation, and provides sufficient information to a client for adaptive streaming of the content by downloading the media segments from the server over HTTP. The MPD describes the various representations of the media components (e.g., bit rates, resolutions, codecs, etc.) and HTTP URLs of the corresponding media segments, timing relationships across the segments, and how they are mapped into media presentations.
The MPD is an XML-based document containing information on the content, based on a hierarchical data model as depicted in Figure 4.2. Each period consists of one or more adaptation sets. An adaptation set contains interchangeable/alternate encodings of one or more media content components encapsulated in representations (e.g., an adaptation set for video, one for primary audio, one for secondary audio, one for captions, etc.). In other words, representations encapsulate media streams that are considered to be perceptually equivalent. Typically, dynamic switching happens across representations within one adaptation set. Segment alignment permits non-overlapping decoding and presentation of segments from different representations. Stream Access Points (SAPs) indicate presentation times and positions in segments at which random access and switching can occur. DASH also uses a simplified version of XLink in order to allow loading parts of the MPD (e.g., periods) in real time from a remote location. The MPD can be static or dynamic: a dynamic MPD (e.g., for live presentations) also provides segment availability start time and end time, approximate media start time, and the fixed or variable duration of segments. It can change and will be periodically reloaded by the client, while a static MPD is valid for the whole presentation. Static MPDs are a good fit for video-on-demand applications, whereas dynamic MPDs are used for live and Personal Video Recorder (PVR) applications.
A DASH segment constitutes the entity body of the response when issuing a HTTP GET or a partial HTTP GET request, and is the minimal individually addressable unit of data. DASH segment formats are defined for the ISO Base Media File Format (BMFF) and the MPEG2 Transport Stream format. A media segment contains media components and is assigned an MPD URL element and a start time in the media presentation. Segment URLs can be provided in the MPD in the form of exact URLs (segment list) or in the form of templates constructed via temporal or numerical indexing of segments. Dynamic construction of URLs is also possible, by combining parts of the URL (base URLs) that appear at different levels of the hierarchy. Each media segment also contains at least one SAP, which is a random access or switch-to point in the media stream where decoding can start using only data from that point forward. An initialization segment contains initialization information for accessing media segments contained in a representation and does not itself contain media data. Index segments, which may appear either as side files or within the media segments, contain timing and random access information, including media time vs. byte range relationships of sub-segments.
DASH provides the ability to the client to fully control the streaming session (i.e., it can intelligently manage the on-time request and smooth playout of the sequence of segments), potentially adjusting bit rates or other attributes in a seamless manner. The client can automatically choose the initial content rate to match the initial available bandwidth and dynamically switch between different bit-rate representations of the media content as the available bandwidth changes. Hence, DASH allows fast adaptation to changing network and link conditions, user preferences, and device states (e.g., display resolution, CPU, memory resources, etc.). Such dynamic adaptation provides better user QoE, with higher video quality, shorter startup delays, fewer rebuffering events, etc.
At MPEG, DASH was standardized by the Systems Sub-Group, with the activity beginning in 2010, becoming a Draft International Standard in January 2011, and an International Standard in November 2011. The MPEG DASH standard [1] was published as ISO/IEC 23009-1:2012 in April 2012. In addition to the definition of media presentation and segment formats standardized in [1], MPEG has also developed additional specifications [6–8] on aspects of implementation guidelines, conformance and reference software, and segment encryption and authentication. Toward enabling interoperability and conformance, DASH also includes profiles as a set of restrictions on the offered MPD and segments based on the ISO BMFF [9] and MPEG2 Transport Streams [10], as depicted in Figure 4.3. In the meantime, MPEG DASH is codec agnostic and supports both multiplexed and non-multiplexed encoded content. Currently, MPEG is also pursuing several core experiments toward identifying further DASH enhancements, such as signaling of quality information, DASH authentication, server and network-assisted DASH operation, controlling DASH client behavior, and spatial relationship descriptions.
At 3GPP, DASH was standardized by the 3GPP SA4 Working Group, with the activity beginning in April 2009 and Release 9 work with updates to Technical Specification (TS) 26.234 on the Packet Switched Streaming Service (PSS) [11] and TS 26.244 on the 3GPP file format [12] completed in March 2010. During Release 10 development, a new specification TS 26.247 on 3GPP DASH [2] was finalized in June 2011, in which ISO BMFF-based DASH profiles were adopted. In conjunction with a core DASH specification, 3GPP DASH also includes additional system-level aspects, such as codec and Digital Rights Management (DRM) profiles, device capability exchange signaling, and QoE reporting. Since Release 11, 3GPP has been studying further enhancements to DASH and toward this purpose collecting new use cases and requirements, as well as operational and deployment guidelines. Some of the documented use cases in the related Technical Report (TR) 26.938 [13] include: operator control for DASH (e.g., for QoE/QoS handling), advanced support for live services, DASH as a download format for push-based delivery services, enhanced ad insertion support, enhancements for fast startup and advanced trick play modes, improved operation with proxy caches, Multimedia Broadcast and Multicast Service (MBMS)-assisted DASH services with content caching at the User Equipment (UE) [8], handling special content over DASH and enforcing specific client behaviors, and use cases on DASH authentication.
The development of QoE evaluation methodologies, performance metrics, and reporting protocols plays a key role in optimizing the delivery of HAS services. In particular, QoE monitoring and feedback are beneficial for detecting and debugging failures, managing streaming performance, enabling intelligent client adaptation (useful for device manufacturer), and allowing for QoE-aware network adaptation and service provisioning (useful for the network operator and content/service provider). Having recognized these benefits, both 3GPP and MPEG bodies have adopted QoE metrics for HAS services as part of their DASH specifications. Moreover, the 3GPP DASH specification also provides mechanisms for triggering QoE measurements at the client device as well as protocols and formats for delivery of QoE reports to the network servers. Here, we shall describe in detail the QoE metrics and reporting framework for 3GPP DASH, while it should be understood that MPEG has also standardized similar QoE metrics in MPEG DASH.
In the 3GPP DASH specification TS 26.247, QoE measurement and reporting capability is defined as an optional feature for client devices. However, if a client supports the QoE reporting feature, the DASH standard also mandates the reporting of all the requested metrics at any given time (i.e., the client should be capable of measuring and reporting all of the QoE metrics specified in the standard). It should also be noted here that 3GPP TS 26.247 also specifies QoE measurement and reporting for HTTP-based progressive download services, where the set of QoE metrics in this case is a subset of those provided for DASH.
Figure 4.4 depicts the QoE monitoring and reporting framework specified in 3GPP TS 26.247, summarizes the list of QoE metrics standardized by 3GPP in the specification TS 26.247, and indicates the list of metrics applicable for DASH/HAS (adaptive streaming) and HTTP-based progressive download (non-adaptive). At a high level, the QoE monitoring and reporting framework is composed of the following phases: (1) server activates/triggers QoE reporting, requests a set of QoE metrics to be reported, and configures the QoE reporting framework; (2) client monitors or measures the requested QoE metrics according to the QoE configuration; (3) client reports the measured parameters to a network server. We now discuss each of these phases in the following sub-sections.
3GPP TS 26.247 specifies two options for the activation or triggering of QoE reporting. The first option is via the QualityMetrics element in the MPD and the second option is via the OMA Device Management (DM) QoE management object. In both cases, the trigger message from the server would include reporting configuration information such as the set of QoE metrics to be reported, the URIs for the server(s) to which the QoE reports should be sent, the format of the QoE reports (e.g., uncompressed or gzip), information on QoE reporting frequency and measurement interval, percentage of sessions for which QoE metrics will be reported, and Access Point Name (APN) to be used for establishing the Packet Data Protocol (PDP) context for sending the QoE reports.
The following QoE metrics have been defined in 3GPP DASH specification TS 26.247, to be measured and reported by the client upon activation by the server. It should be noted that these metrics are specific to HAS and content streaming over the HTTP/TCP/IP stack, and therefore differ considerably from QoE metrics for traditional push-based streaming protocols.
In 3GPP DASH, QoE reports are formatted as an eXtensible Markup Language (XML)3 document complying with the XML schema provided in specification TS 26.247. The client uses HTTP POST request signaling (RFC 2616) carrying XML-formatted metadata in its body to send the QoE report to the server.
The central intelligence in HAS resides in the client rather than the server. The requested representation levels of video chunks (forming the HAS segments) are determined by the client and communicated to the server. Based on the frame levels, the operation of the client in a link-aware adaptive streaming framework can be characterized into four modes or states: (i) startup mode, (ii) transient state, (iii) steady state, and (iv) rebuffering state (see Figure 4.5).
Startup mode is the initial buffering mode, during which the client buffers video frames to a certain limit before beginning to play back the video (i.e., the client is in the startup mode as long as Ai ⩽ AStartUpthresh, where Ai represents the total number of video frames received until frame slot i. Steady state represents the state in which the UE buffer level is above a certain threshold (i.e., Bi ⩽ BSteadythresh), where Bi tracks the number of frames in the client buffer that are available for playback in frame slot i. The transient state is the state in which the UE buffer level falls below a certain limit after beginning to play back (i.e., Bi < BSteadythresh). The rebuffering state is the state that the client enters when the buffer level becomes zero after beginning to play back. Once it enters the rebuffering state, it remains in that state until it rebuilds its buffer level to a satisfactory level to begin playback (i.e., until Bi ⩽ BRebuffthresh).
One of the key aspects of adaptive streaming is the estimate of available link bandwidth. A typical throughput estimate is the average segment or HAS throughput, which is defined as the average ratio of segment size to download time of HAS segments:
where Sj(s), Tfetchj(s), and Tdwldj(s) are the size, fetch time, and download time of the sth video segment of client j, Sij the number of segments downloaded by client j until frame slot i, and F the number of video segments over which the average is computed. Based on this estimate, the best video representation level possible for the next video segment request is determined as follows:
The key QoE metrics of interest are: (i) startup delay, (ii) startup video quality, (iii) overall average video quality, and (iv) rebuffering percentage. Startup delay refers to the amount of time it takes to download the initial frames necessary to begin playback. Average video quality is the average video quality experienced by a user. Startup video quality refers to the average video quality in the startup phase. Rebuffering percentage is the percentage of time the client spends in the rebuffering state. It has been observed that rebuffering is the most annoying to video-streaming users, and hence it is important to keep the rebuffering percentage low by judicious rate adaptation.
Typical HAS algorithms use either application or transport-layer throughputs (as in Eq. (4.1)) for video rate adaptation [14]. We refer to these approaches as PHY Link Unaware (PLU). However, using higher layer throughputs alone can potentially have adverse effects on user QoE when the estimated value is different from what is provided by the wireless link conditions – a lower estimate results in lower quality and a higher estimate can result in rebuffering. These situations typically occur in wireless links due to changes in environmental and/or loading conditions. In [15], a Physical Link-Aware (PLA) approach to adaptive streaming was proposed to improve video QoE in changing wireless conditions. Physical-layer (PHY) goodput, used as a complement to higher layer-throughput estimates, allows us to track radio-link variations at a finer time scale. This opens up the possibility for opportunistic link-aware video-rate adaptation that is to improve the QoE of the user. Average PHY-layer goodput at time t is defined as the ratio of the number of bits received during the time period (t − T, t) to the averaging duration T as follows:
Using PHY goodput for HAS requires collaboration between the application and the physical layers, but it can provide ways to improve various QoE metrics for streaming over wireless using even simple enhancements. Here we describe two simple enhancements for the startup and steady states.
Typical HAS startup algorithms request one video segment every frame slot at the lowest representation level to build the playback buffer quickly. This compromises the playback video quality during the startup phase. Link-aware startup can be used to optimize video quality based on wireless link conditions right from the beginning of the streaming session. An incremental quality approach could be used so that startup delay does not increase beyond satisfactory limits due to quality optimization. The next available video adaptation rate is chosen if enough bandwidth is available to support such a rate. For this purpose, the ratio δi is defined as follows:
This ratio represents the ratio of the average PHY goodput to the next video representation level that is possible. Q0 is initialized based on historical PHY goodput information before the start of the streaming session:
The representation level for the segment request in frame slot i is then selected as follows:
The next representation level is chosen only when δi is greater than (1 + α). α > 0 is a parameter that can be chosen depending on how aggressively or conservatively we would like to optimize quality during the startup phase. The condition δi ⩾ (1 + α) ensures that the rate adaptation does not fluctuate with small-scale fluctuations of wireless link conditions.
For the following evaluation results, we use Peak Signal-to-Noise Ratio (PSNR) for video quality, although our approach is not restricted to this and other metrics such as Structural Similarity (SSIM) could also be used.
Figure 4.6 shows a comparison of the Cumulative Distribution Functions (CDF) of startup delay and average video quality during the startup phase for PLA and PLU approaches. For the 75-user scenario, PHY link awareness can improve the average startup quality by 2 to 3 dB for more than 90% of users, at the cost of only a slight (tolerable) increase in startup delay. In the 150-user scenario, we see a slightly lower 1 to 2 dB improvement in average video quality for more than 50% of users, with less than 0.25 s degradation in startup delay. These results demonstrate that PLA can enhance the QoE during the startup phase by improving video quality with an almost unnoticeable increase in startup delay.
In the steady-state mode, the buffer level at the client is above a certain level. In traditional steady-state algorithms, the objective is to maintain the buffer level without compromising video quality. This is typically done by periodically requesting one segment worth of frames for each segment duration in the steady state. However, this might result in rebuffering in wireless links that fluctuate. PHY goodput responds quickly to wireless link variations, while segment throughput responds more slowly to link variations. So, PHY goodput could be used as a complement to fragment throughput to aid in rate adaptation. When link conditions are good, Rphyt > Riseg and when link conditions are bad, Rphyt < Riseg. A conservative estimate of maximum throughput is determined based on PHY goodput and segment throughput, which could help avoid rebuffering in the steady state. Such a conservative estimate may be achieved as follows:
This approach ensures that (i) when the link conditions are bad and segment throughput is unable to follow the variation in link conditions, we use PHY goodput to lower the estimate of the link bandwidth that is used for video rate adaptation and (ii) when the link conditions are good in steady state, we get as good video quality as using the PLU approach. The constant β in Eq. (4.7) prevents short-term variations in link conditions from changing the rate adaptation. The best video representation level possible in frame slot i, Qsupi, is determined conservatively based on Rconi:
Figure 4.7 compares the CDFs of rebuffering percentage and average playback video quality performance using PLA and PLU approaches for 100 and 150 users. In the 100-user scenario, the number of users not experiencing rebuffering improves from around 75% to 92% (a 17% improvement) and the peak rebuffering percentage experienced by any user reduces from around 30% to 13% using the PLA approach. This improvement in rebuffering performance is at the cost of only a slight degradation in video quality (0.6 dB average) compared with the PLU approach for some users. In the highly loaded 150-user scenario, we observe that using the PLA approach we can obtain around a 20% improvement in number of users not experiencing rebuffering (from around 56% to 76%) at the cost of minimal degradation in average video quality by less than 0.5 dB on average for 50% of users. Thus, PLA can enhance the user QoE during video playback by reducing the rebuffering percentage significantly at the cost of a very minor reduction in video quality.
Wireless links are fluctuating by nature. In most cellular wireless networks, the UEs send to the Base Station (BS) periodic feedback regarding the quality of wireless link that they are experiencing in the form of Channel Quality Information (CQI). The CQI sent by the UEs is discretized, thus making the overall channel state m discrete. The BS translates the CQI into a peak rate vector μm = (μm1, μ2m, ..., μmJ), with μmj representing the peak achievable rate by user j in channel state m. For every scheduling resource, the BS has to make a decision as to which user to schedule in that resource. Scheduling the best user would always result in maximum cell throughput but may result in poor fairness. Scheduling resources in a round-robin fashion might result in an inability to take advantage of the wireless link quality information that is available. So, typical resource allocation algorithms in wireless networks seek to optimize the average service rates R = (R1, R2, R3, …, RJ) to users such that a concave utility function H(R) is maximized subject to the capacity (resource) limits in the wireless scenario under consideration, i.e.
where V represents the capacity region of the system. Utility functions of the sum form have attracted the most interest:
where each Hj(Rj) is a strictly concave, continuously differentiable function defined for Rj > 0. The Proportional Fair (PF) and Maximum Throughput (MT) scheduling algorithms are special cases of objective functions of the form, with Hj(Rj) = log (Rj) and H(Rj) = Rj, respectively.
The key objective of a video-aware optimization framework for multi-user resource allocation is to reduce the possibility of rebuffering without interfering with the rate-adaptation decisions taken by the HAS client. To this end, a buffer-level feedback-based scheduling algorithm in the context of HAS was proposed in [10] by modifying the utility function of the PF algorithm to give priority to users with buffer levels lower than a threshold. However, this emergency-type response penalizes other users into rebuffering, especially at high loading conditions, thus decreasing the effectiveness of the algorithm. To overcome this limitation, a video-aware optimization framework that constrains rebuffering was proposed in [16]. In order to avoid rebuffering at a video client, video segments need to be downloaded at a rate that is faster than the playback rate of the video segments. Let Tj(s) be the duration of time taken by user j to download a video segment s and τj(s) be the media duration of the segment. Then, to avoid rebuffering, the following constraint is introduced:
where δ > 0 is a small design parameter to account for variability in wireless network conditions. Segment download time Tj(s) depends on the size of the video segment Sj(s) and the data rates experienced by user j. Sj(s) in turn depends on the video content and representation (adaptation) level that is chosen by the HAS client. The HAS client chooses the representation level for each video segment based on its state and its estimate of the available link bandwidth. Based on all this, we propose a Rebuffering Constrained Resource Allocation (RCRA) framework as follows:
The additional constraints related to rebuffering closely relate the buffer evolution at HAS clients to resource allocation at the base station. Intelligent resource allocation at the BS can help reduce rebuffering in video clients.
Enforcing the rebuffering constraints in Eq. (4.12) in a practical manner requires feedback from HAS clients. Each adaptive streaming user can feed back its media playback buffer level periodically to the BS scheduler in addition to the normal CQI feedback. The buffer-level feedback can be done directly over the RAN or more practically, indirectly through the video server.
Scheduling algorithms for multi-user wireless networks need to make decisions during every scheduling time slot (resource) t in such a way as to lead to a long-term optimal solution. The scheduling time slot for modern wireless networks is typically at much finer granularity than a (video) frame slot. A variant of the gradient scheduling algorithm called the Rebuffering-Aware Gradient Algorithm (RAGA) in [16] can be used to solve the optimization problem in Eq. (4.12) by using a token-based mechanism to enforce the rebuffering constraints. The RAGA scheduling decision in scheduling time slot t when the channel state is m(t) can be summarized as follows:
where Rj(t) is the current moving-average service rate estimate for user j. It is updated every scheduling time slot as in the PF scheduling algorithm, i.e.
where β > 0 is a small parameter that determines the time scale of averaging and μj(t) is the service rate of user j in time slot t. μj(t) = μm(t)j if user j was scheduled in time slot t and μj(t) = 0 otherwise. Wj(t) in Eq. (4.13) is a video-aware user token parameter and aj(t) is a video-aware user time-scale parameter, both of which are updated based on periodic media buffer-level feedback. These parameters hold the key to enforcing rebuffering constraints at the BS. Such a feedback mechanism has been defined in the DASH standard [1, 2] and is independent of specific client player implementation. For simplicity, we assume that such client media buffer-level feedback is available only at the granularity of a frame slot. Therefore, the user-token parameter and user-time-scale parameter are constant within a frame slot, i.e.
Let Bij represent the buffer status feedback in frame slot i in units of media time duration. The difference between buffer levels from frame slot (i − 1) to frame slot i is given by
A positive value for Bi, diffj indicates an effective increase in media buffer size in the previous reporting duration and a negative value indicates a decrease in media buffer size. Note that this difference depends on frame playback and download processes at the HAS client. To avoid rebuffering, we would like the rate of change at the client media buffer level to be greater than a certain positive threshold, i.e.
The media-buffer-aware user-token parameter is updated every frame slot as follows:
The intuitive interpretation of Eq. (4.17) is that if the rate of media buffer change for a certain user is below the threshold, the token parameter is incremented by an amount (δτ − Bdiffj) that reflects the relative penalty for having a buffer change rate below the threshold. This increases its relative scheduling priority compared with other users whose media buffer change rate is higher. Similarly, when the rate of buffer change is above the threshold, the user-token parameter is decreased to offset any previous increase in scheduling priority. Wij is not reduced below zero, reflecting the fact that all users with a consistent buffer rate change greater than the threshold have scheduling priorities as per the standard proportional fair scheduler.
The video-aware parameter aij determines the time scale over which rebuffering constraints are enforced for adaptive streaming users. A larger value of aij implies greater urgency in enforcing the rebuffering constraints for user j. In a HAS scenario, the values of aij can be set to reflect this relative urgency for different users. Therefore, we set aij based on the media buffer level of user j in frame slot i as follows:
where φ is a scaling constant, Bij is the current buffer level in seconds for user j, and BSteadythresh is the threshold for the steady-state operation of the HAS video client. If the buffer level Bij for user j is above the threshold, then aij = 1 and if it is below the threshold, then aij scales to give relatively higher priorities to users with lower buffer levels. This scaling of priorities based on absolute user buffer levels improves the convergence of the algorithm. The user-time-scale parameter aij is set to 0 for non-adaptive streaming users, turning the metric in Eq. (4.13) into a standard PF metric. Note that the parameter Wj(t) is updated based on the rate of media-buffer-level change, while the parameter aj(t) is updated based on the buffer levels themselves. Such an approach provides a continuous adaptation of user scheduling priorities based on media-buffer-level feedback (unlike an emergency-response-type response) and reduces the rebuffering percentage of users without significantly impacting video quality.
Figure 4.8 compares the rebuffering percentage and the Perceived Video Quality (PVQ) of resource allocation algorithm RAGA with standard Proportional Fair (PF), Proportional Fair with Barrier for Frames (PFBF), and GMR (Gradient with Minimum Rate) algorithms in a 100-user scenario. For GMR, we set the minimum rate for each video user to the rate of the lowest representation level of the user's video. PVQ is computed as the difference between the mean and standard deviation of PSNR. Only played-out video frames are considered in the computation of PVQ. Observe that RAGA has the lowest rebuffering percentage among all the schemes across all the users. It has reduced the number of users experiencing rebuffering and also the amount of rebuffering experienced by the users. The PVQ using RAGA is better than PF scheduling for all users. GMR is better than PF in terms of rebuffering, but it still lags behind RAGA in rebuffering performance due to a lack of dynamic cooperation with the video clients. Although GMR appears to have marginally better PVQ than RAGA, this is at a huge cost in terms of increased rebuffering percentages. PFBF performs better than GMR in terms of peak rebuffering percentage but lags behind both PF and GMR in terms of the number of users experiencing rebuffering. Also, PFBF has better PVQ than all schemes for some users and worse than all schemes for others. The disadvantage with PFBF is that it reacts to low buffer levels in an emergency fashion and inadvertently penalizes good users to satisfy users with low buffer levels. RAGA continually adjusts the scheduling priorities of the users based on the rate of change of media buffer levels, thus improving the QoE of streaming users in terms of reduced rebuffering and balanced PVQ.
As the multicast standard for Long-Term Evolution (LTE), enhanced Multimedia Broadcast Multicast Service (e-MBMS) was introduced by 3GPP to facilitate delivery of popular content to multiple users over a cellular network in a scalable fashion. Delivery of popular YouTube clips, live sports events, news updates, advertisements, file sharing, etc. are relevant use cases for eMBMS. eMBMS utilizes the network bandwidth more efficiently than unicast delivery by using the inherent broadcast nature of wireless channels. For unicast transmissions, retransmissions based on Automatic Repeat Request (ARQ) and/or Hybrid ARQ (HARQ) are used to ensure reliability. However, for a broadcast transmission, implementing ARQ can lead to network congestion with multiple users requesting different packets. Moreover, different users might lose different packets and retransmission could mean sending a large chunk of the original content again, leading to inefficient use of bandwidth as well as increased latency for some users. Application Layer Forward Error Correction (AL-FEC) is an error-correction mechanism in which redundant data is sent to facilitate recovery of lost packets. For this purpose, Raptor codes [17, 18] were adopted in 3GPP TS 26.346 [19] as the AL-FEC scheme for MBMS delivery. Recently, improvements in the Raptor codes have been developed and an enhanced code called RaptorQ has been specified in RFC 6330 [20] and proposed to 3GPP. Streaming delivery (based on the H.264/AVC video codec and Real-time Transport Protocol (RTP)) over MBMS was studied in [21].
The forthcoming discussion presents the existing standardized framework in TS 26.346 [19] for live streaming of DASH-formatted content over eMBMS. The eMBMS-based live video streaming is over the FLUTE protocol [22] – file delivery over unidirectional transport – which allows for transmission of files via unidirectional eMBMS bearers. Each video session is delivered as a FLUTE transport object, as depicted in Figure 4.9. Transport objects are created as soon as packets come in. The IPv4/UDP/FLUTE header is a total of 44 bytes per IP packet. Protection against potential packet errors can be enabled through the use of AL-FEC. The AL-FEC framework decomposes each file into a number of source blocks of approximately equal size. Each source block is then broken into K source symbols of fixed symbol size T bytes. The Raptor/RaptorQ codes are used to form N encoding symbols from the original K source symbols, where N > K. Both Raptor and RaptorQ are systematic codes, which means that the original source symbols are transmitted unchanged as the first K encoding symbols. The encoding symbols are then used to form IP packets and sent. At the decoder, it is possible to recover the whole source block from any set of encoding symbols only slightly greater than K with a very high probability. Detailed comparisons between Raptor and RaptorQ are presented in [23]. The choice of the AL-FEC parameters is made at the Broadcast Multicast Service Center (BMSC). For example, the BMSC has to select the number of source symbols K, the code rate K/N, and the source symbol size T. For a detailed discussion on the pros and cons of choosing these different parameters, the reader is referred to [24].
When simulating a live service, a long waiting time for encoding is not desirable. However, to ensure good Raptor/RaptorQ performance, a large value of K needs to be chosen. Thus, the minimum value of K = Kmin is an important design parameter. A larger Kmin causes a large startup delay, whereas a smaller Kmin leads to poor performance. N encoding symbols are generated from K symbols using the AL-FEC (Raptor/RaptorQ) scheme. IP packets are then formed using these encoding symbols as payloads. The FLUTE packet is generated from the FLUTE header and payload containing the encoding symbols.
IP packets (RLC-SDUs (Service Data Units)) are mapped into fixed-length RLC-PDUs (Protocol Data Units). A 3GPP RAN1-endorsed two-state Markov model can be used to simulate LTE RLC-PDU losses, as shown in Figure 4.10. A state is good if it has less than 10% packet loss probability for the 1% and 5% BLER simulations, or less than 40% packet loss probability for the 10% and 20% BLER simulations.
The parameters in the figure are as follows: p is the transition probability from a good state to a bad state; q is the transition probability from a bad state to a good state; pg is the BLER in a good state; pb is the BLER in a bad state. It can be seen that the RAN model described above does not capture the coverage aspect of a cell, since it is the same for all users. For a more comprehensive end-to-end analysis, the following model can be used.
Instead of using a Markov model for all the users as above, a separate Markov model for each user in a cell can be used [24]. The received SINR data for each user is then used to generate a Multicast Broadcast Single-Frequency Network (MBSFN) sub-frame loss pattern. Such data can be collected for different MCS (Modulation and Coding Scheme) values. Using the sub-frame loss pattern for a given MCS, separate Markov models can be generated for each user in a cell. Note that this model is not fundamentally different from the RAN-endorsed model, but it accounts for the varying BLER distribution across users in a cellular environment. The BLER distribution depends on the specific deployment models and assumptions and could be different subject to different coverage statistics.
The performance bounds for eMBMS can be evaluated under different conditions. The bearer bit rate is assumed to be 1.0656 Mbits/s. Publicly available video traces can be used for video traffic modeling (http://trace.eas.asu.edu). Video traces are files mainly containing video-frame time stamps, frame types (e.g., I, P, or B), encoded frame sizes (in bits), and frame qualities (e.g., PSNR) in a Group of Pictures (GoP) structure. The length of an RLC-SDU is taken as 10 ms. The content length is set at 17,000 frames for each video trace. The video-frame frequency is considered to be 30 frames/s. The video frames are then used to generate source blocks and encoding symbols are generated using the AL-FEC framework (both Raptor/RaptorQ). The system-level simulations offer beneficial insights on the effect of system level and AL-FEC parameters on the overall QoE.
Different QoE metrics can be considered for multimedia delivery to mobile devices. In the case of file download or streaming of stored content, on user request, there is an initial startup delay after which streaming of video occurs and QoE can be measured by the initial startup delay and fraction of time that rebuffering occurs. The main contribution to startup delay for eMBMS live streaming is the AL-FEC encoding delay (i.e., when the service provider has to wait for a sufficient number of frames to be generated to ensure a large enough source block for efficient AL-FEC implementation). The source symbol size is chosen as T = 16 bytes. It is kept small in order to decrease the initial startup delay, so that a larger value of K can be chosen for the same source block.
The average startup delay (averaged over different code rates K/N = 0, 6, 0.7, 0.8, 0.9) is plotted in Figure 4.11 as a function of Kmin. As expected, the startup delay increases with increasing Kmin. The average PSNR of the received video stream is calculated using the offset trace file used for simulations. When a frame is lost, the client tries to conceal the lost frame by repeating the last successfully received frame. The rebuffering percentage is defined as the fraction of time that video playback is stalled in the mobile device. For live streaming, rebuffering occurs whenever two or more consecutive frames are lost. The client repeats the last successfully received frame and the video appears as stalled to the user. Video playback resumes as soon as one of the future frames is received successfully. The empirical Cumulative Density Function (CDF) of the PSNR and the rebuffering percentage for code rates 0.9 and 0.8 are shown in Figures 4.12 and 4.13, respectively. Kmin is fixed to be 64. For detailed simulation parameters and algorithms, refer to [24]. It can be observed that improving the code rate improves the coverage from a QoE perspective, as it guarantees better PSNR and rebuffering for more users.
One of the most common problems associated with video streaming is the clients' unawareness of server and network conditions. Clients usually issue requests based on their bandwidth, unaware of the server's status which comprises factors such as
Thus, clients tend to request segments belonging to representations at the highest possible bit rates based on their perception, regardless of the server's condition. This kind of behavior often causes clients to compete for the available bandwidth and overload the server. As a result, clients could encounter playback stalls and pauses, which deteriorate QoE. Figure 4.14 shows a typical example of multiple clients streaming simultaneously. Initially, with only one streaming client, the available bandwidth for content streaming is high and the client gets the best possible quality based on his/her bandwidth. With more clients joining in the streaming process, clients starts to compete for the bandwidth and consequently the QoE drops. Greedy clients tend to eat up network bandwidth and stream at higher quality, leaving the rest of the clients to suffer much lower QoE.
Existing load-balancing algorithms blindly distribute bandwidth equally among streaming clients. However, an equal bandwidth-sharing strategy might not always be the best solution, since it may not provide the same QoE. For example, fast or complex-motion content, as in soccer or action movies, typically requires more bandwidth in order to achieve equal quality to low-motion content, such as a newscast.
In our proposed solution, both the clients and the server share additional information through a feedback mechanism. Such information includes
The clients notify the server of their perceived QoE so far. This is in the form of statistics sent by the client regarding the client's average requested bit rate, average Mean Opinion Score (MOS), number of buffering events, etc. Other quality metrics can also be used. The server uses the client information in order to perform quality-based load balancing.
The server in return advises each client about the bandwidth limit it can request. In other words, the server can notify each client which DASH representations can be requested at any given time. This is achieved by sending the clients a special binary code, the Available Representation Code (ARC). ARC includes a bit for each representation, the Representation Access Bit (RAB), which can be either 0 or 1. The rightmost bit in the ARC corresponds to the representation with the highest bit rate, while the leftmost bit corresponds to the representation with the least bit rate.
As fluctuations to the server's upload rate occur, the server starts limiting the representations available to the clients. The server deactivates the representations available to clients in such a manner that at any point in time the maximum total bit rate requested by all clients does not exceed the server's upload rate. By defining such limits, the server remains at less risk of being overloaded and hence there are fewer delays in content transfer, leading to higher QoE at the streaming clients' side. The selection of which representation to be enabled or disabled is subject to server-based algorithms.
Different outcomes regarding the collective QoE of the streaming clients can be achieved depending on the algorithm selected for representation (de-)activation. In scenarios where the server gets overloaded with requests, limiting the representations available to clients can be useful in different ways. In the following sub-sections, two load-balancing approaches based on our server-assisted feedback will be described. These approaches are:
This algorithm's main focus is to minimize significant drops in quality on a per-user basis. In other words, higher priority is given to users who will experience a bigger gap in quality in case a representation is to be deactivated. This approach uses iterative steps to select the representation to be disabled or enabled when the server's upload rate changes or clients' requests exceed the maximum server upload rate. The procedure can be summarized as follows:
As an example, Table 4.1 lists the typical quality changes experienced by two clients streaming different contents at different bit rates. Table 4.2 lists the corresponding outcome per iteration when the maximum upload rate changes from 2000 kbits/s to 1200 kbits/s.
Table 4.1 Average PSNR per representation (bit rate) for two different contents
Representation | Client 1 | Client 2 | |||
I | Ri (kbits/s) | PSNR | ΔPSNRi (Rmax − Ri) | PSNR | ΔPSNRi (Rmax − Ri) |
0 | 1000 | 46.13 | — | 37.97 | — |
1 | 800 | 43.70 | 2.43 | 37.49 | 0.48 |
2 | 600 | 40.57 | 5.56 | 36.80 | 1.17 |
3 | 400 | 36.58 | 9.55 | 34.49 | 3.48 |
4 | 300 | 32.14 | 13.99 | 32.55 | 5.42 |
Table 4.2 Tracing table for Algorithm 1 in case available bandwidth drops from 2000 kbits/s to 1200 kbits/s
Iteration number |
Client 1 | Client 2 | Total bandwidth (kbps) |
||||
Ri/bit rate | ΔPSNRi | ARC | Ri/bit rate | ΔPSNRi | ARC | ||
1 | R0/1000 | 11111 | R0/1000 | 0 | 11111 | 2000 | |
2 | R1/800 | 2.43 | 11111 | R1/800 | 0.48 | 11110 | 1800 |
3 | R1/800 | 2.43 | 11111 | R2/600 | 1.17 | 11100 | 1600 |
4 | R1/800 | 2.43 | 11110 | R3/400 | 3.48 | 11100 | 1400 |
5 | R2/600 | 5.56 | 11110 | R3/400 | 3.48 | 11000 | 1200 |
In Table 4.1:
The action performed at each iteration can be explained as follows:
Since the server is no longer overloaded, clients are at less risk of buffering stalls. In our approach, we used PSNR as a balancing criterion. PSNR values are pre-calculated for each DASH segment and stored in the DASH MPD. Other criteria, such as the MOS, can also be used with the same approach.
The minimum quality reduction approach mainly exploits feedback sent from the server to clients but not the other way around. The approach discussed in this section exploits two-way feedback.
In this algorithm, the main focus is to set all clients to approximately the same average quality. The iterative procedure is described as follows:
Go to step 4(c).
Else, remove the client from the list of candidates and go to step 4(b).
Deactivate a representation from MaxClient. If the bandwidth saved as a result of the deactivation suffices to enable a representation for MinClient, a representation is activated. Go to step 4(c).
Using the same values as in Table 4.1, an illustrative example is shown in Table 4.3 where the server's upload rate is also set to 1200 kbits/s.
Table 4.3 Tracing results for Algorithm 2 in case the bandwidth allowed drops from 2000 kbits/s to 1200 kbits/s
Iteration number |
Client 1 | Client 2 | Total bandwidth (kbps) |
||||
Ri/bit rate Max | PSNRi | ARC | Ri/bit rate Max | PSNRi | ARC | ||
1 | R0/1000 | 45 | 11111 | R0/1000 | 37 | 11111 | 2000 |
2 | R1/800 | 43.70 | 11110 | R0/1000 | 37.97 | 11111 | 1800 |
3 | R2/600 | 40.57 | 11100 | R0/1000 | 37.97 | 11111 | 1600 |
4 | R3/400 | 36.58 | 11000 | R0/1000 | 37.97 | 11111 | 1400 |
5 | R3/400 | 36.58 | 11000 | R1/800 | 37.49 | 11110 | 1200 |
The details of each iteration step can be explained as follows:
The algorithm continues until the sum of the highest bit rates permissible for each client does not exceed the server's upload rate. Using such an algorithm ensures that all clients stream at almost the same average quality.
Experimental results have verified that the use of server-assisted feedback approaches result in:
On the contrary, there was little or no perceivable quality loss since clients – when aware of the server load condition – tended to request low-quality segments to avoid buffering events or long stalls.
We have given an overview of the latest DASH standardization activities at MPEG and 3GPP and reviewed a number of research vectors that we are pursuing with regard to optimizing DASH delivery over wireless networks. We believe that this is an area with a rich set of research opportunities and that further work could be conducted in the following domains.
Third-Generation Partnership Project
Application Layer Forward Error Correction
Automatic Repeat Request
Advanced Video Coding
Block Error Rate
Broadcast Multicast Service Center
Base Station
Cumulative Distribution Function
Channel Quality Information
Dynamic Adaptive Streaming over HTTP
Digital Entertainment Content Ecosystem
Digital Living Network Alliance
Device Management
Digital Rights Management
Enhanced MBMS
File Delivery over Unidirectional Transport
Gradient with Minimum Rate
Hybrid ARQ
HTTP Adaptive Streaming
Hybrid Broadcast Broadband TV
Hypertext Transfer Protocol
Internet Engineering Task Force
Internet Protocol
IP Television
ISO Base Media File Format
Long-Term Evolution
Multimedia Broadcast and Multicast Service
Multicast Broadcast Single-Frequency Network
Modulation and Coding Scheme
Mean Opinion Score
Media Presentation Description
Moving Picture Experts Group
Network Address Translation
Open IPTV Forum
Open Mobile Alliance
Policy Charging and Control
Packet Data Protocol
Proportional Fair
Proportional Fair with Barrier for Frames
Peak Signal-to-Noise Ratio
Packet-Switched Streaming Service
Perceived Video Quality
Personal Video Recorder
Quality of Experience
Quality of Service
Radio Access Network
Radio Link Control
Real-time Transport Protocol
Real-Time Streaming Protocol
Service Data Unit
Signal-to-Interference-and-Noise Ratio
Structural Similarity
Transmission Control Protocol
Technical Report
Technical Specification
User Equipment
Uniform Resource Locator
Wireless Fidelity
Wireless Local Area Network
Wireless Wide Area Network
World-Wide-Web Consortium
XML Linking Language
Extensible Markup Language