Time-Based Media

The term media possesses a number of meanings and connotations to most people: from modern print and broadcast press to the inclusion of terms such as multimedia.

Time-based media, from the perspective of the JMF API and Java, is broadly defined as any data that varies in a meaningful manner with respect to time.

Implicitly, two further properties are understood to be possessed by time-based media:

  • The data is intended for presentation (perhaps not immediately but at some, possibly future, stage) to a human being. With current technologies, that is understood to be through vision or hearing.

  • The data is in a digital format. Typically this involves capturing (digitizing) analogue (real-world) data such as from a microphone. Alternatively, the media might inherently be digital such as speech synthesized by a computer.

Thus, time-based media is generally understood to be video, audio, or a combination of both.

Both categories, audio and video, can be subdivided into naturally captured media (for example, microphone or video camera) and synthetically produced media (for example, 3D animation sequences). However, the boundaries between natural and synthesized media aren't clear, becoming less so daily. Even naturally captured media is subject to post-capture processing such as to enhance or add features that weren't in the original. (The movie industry practice of blue screening to merge matte painted backgrounds with film of actors recorded in a studio is a classic example of this.) Figure 7.1 shows this breakdown of the types of time-based media. Indeed, the blurring of the distinction between natural and synthetic media is almost a direct result of the fact that after the media is digitized (if that was even necessary), it can be processed in any manner imaginable. In the most general sense of processing that includes capturing, presenting, transmitting, as well as converting, compressing, and so on, controlling and processing time-based media is exactly what the JMF is intended for.

Figure 7.1. Origins and types of time-based media.


Typical examples of time-based media include TV broadcasts, the data captured from a microphone or video camera attached to a PC, an MP3 file on a hard disk, a video conference across the Internet, and webcasts.

Throughout this and the following chapters, time-based might be dropped from the term media. Such usage indicates time-based media as previously defined, and not any other meaning ascribable to the word media.

Time-Based Media on a Computer

In the past decade, there has been a revolution in terms of access to and in particular generation of time-based media, most notably digital media. Previously, only large production companies in the form of movie studios, TV and radio stations, and other such specialists, were capable of producing high-quality media. Correspondingly, dedicated devices or venues were required to present such media for an audience: a TV set for TV broadcasts, a radio for radio broadcasts, and a cinema for movies.

Changes in computing, both in terms of technology and penetration of daily life, have fundamentally altered the paradigm from production only by specialists and presentation only on dedicated devices to production by anyone (with a computer) and a very versatile presentation option (the computer).

The technological advances that have driven this change include the ongoing and significant increases in both processor power and storage capacity of the PC, together with improvements in networking and telecommunication. These hardware advances have gone hand-in-hand with software developments that have made it possible to harness the greater power afforded by the average PC. Advances in processor power have meant that the various compressed formats used to store video and audio could be processed in real time. Thus, a PC could be used to present and even save the media. Advances in storage capacity, as first witnessed by the advent of the CD-ROM as a standard peripheral, increasingly large (in terms of storage capacity) hard disks, and more recently the DVD (digital versatile disc) have meant that inherently large media files can be stored on a PC. Correspondingly, network advances have meant that it is now possible, and commonplace, to access media stored or generated remotely.

Socially, the computer has transitioned from being seen as a specialist device for computation and calculation to a general-purpose household item with wide applicability in many areas—not the least of which is communication. This is particularly illustrated by the World Wide Web (WWW) in which users not only see it as commonplace to surf the Net (pulling in content from all over the world), but also increasingly expect or demand that the content be dynamic and entertaining—often with time-based media! The JMF was designed, at least in part, with this purpose in mind, and is centrally placed: Java has been a key-enabler of the Web and in particular Web-interactivity since its earliest days—JMF further enhances the Web support power of Java.

Bandwidth, Compression, and Codecs

Time-based media, in its raw form suitable for presentation through speakers or on a display, is particularly large—high in bandwidth. That poses a particular challenge in the area of storing (for example, on a hard disk) and transmitting (for example, over a modem) media and introduces the idea of compression.

The following sections on audio and video go into further detail, but for a moment consider the size of a typical three-minute audio track on a music CD, bearing in mind that raw video is even more demanding (often about 100 times more).

The raw audio format is known as PCM (Pulse Code Modulation). CD audio is particularly high quality: It covers the entire range of human hearing (which ranges up to about 20KHz: twenty thousand hertz). With such accuracy in representation, most people cannot discern the difference between the original and the stored signal. To achieve that detailed representation, 44,100 samples are taken each second for each of the left and right audio channels. Each sample is 16 bits (two bytes: a range of some 65,536 possible values). That equates to 176,400 bytes or 1,411,200 bits of information per second. For the three-minute piece of music, that equates to 31,752,000 bytes (over 30 megabytes) of information.

A modern PC's hard disk will soon fill with a few hundred such audio files. More significant and sobering is the transfer rate required to stream that audio data across a network so that it can be played in real-time: over one million bits per second. Contrasting that with a 56K modem(peak performance not reaching 56,000 bits per second) that most home users have as their means of connecting to the Internet, it can be seen that compromises are necessary: the required transfer rate exceeds that possible by a factor greater than 20 times.

The need for compression is obvious. Although the particulars of modern compression algorithms are complicated, the fundamentals of all approaches are the same. The media is kept in a compressed format while being stored or transmitted. The media is decompressed only immediately prior to presentation or if required for processing (for example, to add an effect).

The components that perform this task of compression and decompression are known as codecs (COmpression/DECompression) and can work in hardware or software. For each audio and video, there is a range of codecs that vary in their compression capabilities: the quality of the resulting media, amount of processing required, and the support they receive from the major companies working in the multimedia arena.

Most codecs are lossy, meaning that they don't perfectly preserve the original media: Some quality of the original media is lost when it is compressed and is thereafter unrecoverable. Although this is unfortunate, appropriate design of the codec can result in some or most of the losses not being perceptible to a human audience. Examples of such losses might be the blurring of straight edges (for example, text) in a video image or the addition of a slight buzz to the audio. Regardless of the undesirability of these losses in quality, no known lossless codecs are capable of achieving anywhere near the compression necessary for streaming high quality audio and video over a typical (home) user of today's connection to the wider network.

All codecs employ one or more of the following three general strategies in order to achieve significant compression:

Spatial redundancy— These schemes exploit repetition within the current frame (sample) of data. Although not applicable for audio encoding in which each frame is a single value, significant savings can be made for typical video images. Most images have regions of a solid color—backgrounds such as a blue sky, the beige walls of a house, or individual subject elements such as a white refrigerator or a solid-color shirt. Basically, such schemes can be thought of as recording the recurring color and the region of the image that it ranges over, rather than keeping multiple copies (one for each pixel that composes the solid color block) of the same thing.

Temporal redundancy— These schemes exploit the fact that the difference between successive video frames or successive audio samples is generally small relative to the size of the frame or sample itself). Rather than transmit or store a completely new frame or sample, only the difference from the previous sample needs to be stored or transmitted. For both audio and video, this approach is generally very effective. Although there are instances, such as a new scene in a video, in which that isn't true. A strong example of the benefits of such approaches include video of a news anchorperson: Most of the image is static, and only relatively minor changes occur from frame to frame—the anchorperson's head and facial movements. Even far more dynamic video (for example, a football match) still has considerable static (from frame to frame) regions, and significant savings are still achieved. Similarly, most sound—whether speech, music, or noise—is tightly constrained in a temporal sense. Temporal encoding based schemes pose challenges for non-linear editing. (A frame is defined in terms of its predecessor, but what if that predecessor is removed or, even more challenging, altered?) Such schemes tend to degrade in compression performance and quality over a long period time. For both reasons, these schemes periodically (for example, once per second) transmit a completely new frame (known as a key-frame).

Features of human perception— The human visual and auditory systems have particular idiosyncrasies that might be exploited. These include non-linearity across the spectrum being perceived as well as more complex phenomenon such as masking. Visually, humans distinguish some regions of the color spectrum less keenly, whereas in the auditory domain, human perception strongly emphasizes the lower frequency (deeper) components of a sound at the expense of those higher frequency components. Clever coding schemes can exploit these coarser regions of perception and dedicate fewer resources to their representation. Strategies based on human perception differ fundamentally to the two pervious schemes because they are based on subjective rather than objective measures and results.

Figure 7.2 shows the concept of spatial compression, whereas Figure 7.3 shows temporal compression. Figure 7.4 shows the non-linearity of human perception in the auditory domain: the range of human hearing (in Hertz on the horizontal axis) is shown against the perceptually critical bands (bark) found through psychoacoustic experiments. Such known relationships can be exploited by audio compression schemes.

Figure 7.2. Spatial compression opportunities.


Figure 7.3. Temporal compression could be used to record only the differences from the previous frame.


Figure 7.4. Perceptually critical bands (bark) of human hearing matched against the frequency range of human hearing.


Format, Content Type, and Standards

The codec used to encode and decode a media stream defines its format. Thus the format of a media stream describes the actual low-level structure of the stream of data. Examples of formats include cinepak, H.263, and MPEG-1 in the video domain and Mu-Law, ADPCM, and MPEG-1 in the audio domain.

Sitting atop a media's format, and often being confused with it, is known as the content type or sometimes the architecture of the media. The content type serves as a type of super-structure allowing the specification of codecs and other details such as file structure of the total API. Examples of content types include such well-known names as AVI, QuickTime, MPEG, and WAV.

As an illustration of the distinction between media format and content type, it is worth noting that most content types support multiple possible formats. Thus the QuickTime content type can employ Cinepak, H.261, and RGB video formats (among others), whereas the WAV (Wave) content type might be A-law, DVI ADPCM, or U-Law (among others). Hence an alternative model is to see the various content types as media containers; each can hold media in a number of different formats.

An obvious question, given the apparent profusion of formats and content types, is where are the standards? Why are there so many formats and content types, and are they all really necessary?

International standards do exist in the form of the various MPEG versions. (It's currently at three, although the latest version is known as MPEG-4 because no MPEG-3 standard exists.) MPEG stands for the Motion Picture Expert Group and is a joint committee of the ISO (International Standards Organization) and IEC (International Electrotechnical Commission). These standards are of very high quality: well designed and with high compression. However, because of a number of interrelated factors that include commercial interests, differences in technology, historic developments, as well as differing requirements from formats, these standards are yet to dominate the entirety of the audio and video fields.

Perhaps the most important reason that various formats exist is that each is designed with a different purpose in mind. Although some are clearly better than others (particularly older formats) in a number of dimensions, none dominate in all aspects. The most important aspects of differentiation are degree of compression, quality of the resulting media, and processing requirements. These three aspects aren't mutually exclusive, but are competing factors: For instance, higher compressions are likely to require greater processing and result in more loss of quality. Various formats (codecs) weight these factors differently, resulting in formats with diverse strengths and weaknesses. It becomes clear then that there is no single best format; the best can only be defined in terms of the constraints and requirements of a particular application.

On the other hand, the different content types are chiefly attributable to commercial and historical developments. Some content types such as QuickTime and AVI, although now almost cross-platform standards, were traditionally associated with a particular PC platform: the Macintosh in the case of QuickTime and the Windows PC in the case of AVI. The advent of the WWW and more powerful PCs have seen a second generation of content type such as RealMedia (RealAudio/RealVideo), which is specifically targeted at streaming media across the Internet.

Tracks and Multiplexing/Demultiplexing

Time-based media often consists of more than one channel of data. Each of these channels is known as a track. Examples include the left and right channels for traditional stereo audio or the audio and video track on an AVI movie. Recent standards, such as the MPEG-4 content type, support the concept of a multitude of tracks composing a single media object.

Each track within a media object has its own format. For instance, the AVI movie could possess a video track in MJPG (Motion JPEG) format and an audio track in ADPCM (Adaptive Differential Pulse Code Modulation) format. The media object, however, has a single content type (in our example, AVI). Such multitrack media are known as multiplexed.

Creation of multiplexed media involves combining multiple tracks of data, a process known as multiplexing. For instance, the audio track captured from a microphone would be multiplexed with the video track captured from a video camera in order to create a movie object. Similarly, the processing of existing media might result in further multiplexing as additional tracks (for instance, a text track of subtitles for a movie) are added to the media.

The corollary operation of separating individual tracks from a multitrack media object is known as demultiplexing. This is necessary prior to presentation of the media so that each track can have the appropriate codec applied for decompression and the resulting raw media sent to the correct output device (for example, speakers for audio track, display for the video track).

If processing of a media object is required, the appropriate tracks would need to be demultiplexed so that they could be treated in isolation, processed (such as to add an effect), and then multiplexed back into the media object. This processing can also result in the generation of new tracks, which then need to be multiplexed into the media object. An example of this might be adding subtitles to a movie: the audio track is demultiplexed and processed automatically by a speech recognizer to generate a transcription as a new track. That new track is then multiplexed back in with the original video and audio.

Figures 7.5, 7.6 and 7.7 show the various roles of multiplexing and demultiplexing in media creation, processing, and presentation.

Figure 7.5. Role of demultiplexer in playback of media.


Figure 7.6. Role of multiplexer in capture of media.


Figure 7.7. Role of demultiplexer and multiplexer in the processing of media.


Streaming

The origins of time-based media on the computer lie in applications where media was stored on devices such as a CD-ROM and played from that local source. These forms of applications are still commonplace and important. They were enabled by emerging technologies, such as higher storage capacity devices in the form of CD-ROMs; similarly, Internet technology (combined with increasing computing power) has led to challenging new areas of application for time-based media.

True streaming (also called real-time streaming) of media is the transfer and presentation, as it arrives, of media from a remote site (in the typical case, across the Internet). Examples of such streaming of media can be found in the numerous webcasts that have proliferated on the Web, including numerous radio stations and national TV broadcasters such as the BBC.

A hybrid form of streaming known as progressive streaming also exists, which is less technically challenging than true streaming and quite common on the Web today. Progressive streaming is employed where it is expected or known that the bandwidth requirements of the media (in order to play in real-time) exceed the available bandwidth for transfer. With progressive streaming, the media is downloaded to your system's hard disk. However, the rate of transfer and portion downloaded is monitored. When the estimated (based on current transfer rate) time to complete the transfer drops below the time required to play the entire media, play is begun. This ensures that play of the media is begun as soon as possible while guaranteeing (as long as transfer rate doesn't drop) that the presentation will be continuous.

In this passive reception aspect, streaming, in its end result, is little different from the already familiar forms of radio and TV. The aspect that really empowers the potential of streaming is that media creation (not just reception) is possible for each user. This enables new levels of communication between users when audio and video can be streamed between sites in real-time. The “killer application” of this technology is the video conference: all participants in the conference stream audio and video of themselves to all other participants while simultaneously receiving and playing or viewing the streams received from other participants. Figure 7.8 shows a typical video conferencing scenario.

Figure 7.8. Typical video conferencing scenario.


Streaming affords considerable technical challenges, many of which still haven't been overcome adequately. Not the least of these challenges is the already discussed bandwidth requirements for time-based media. Streaming across the low-bandwidth connections to the Internet possessed by today's typical user—a 56K modem—can be achieved only by the application of the most extreme compression codecs, resulting in severe quality loss (typically a few pixilated or blurred frames per second). The situation is less extreme for audio but still not perfect. The situation is only exacerbated by the fact that simultaneous, bi-directional streaming is required for applications such as video conferencing: both sites transmitting and receiving media simultaneously.

The challenges don't stop simply at bandwidth limitations but more generally stem from data transmission across a network, typically a Wide Area Network (WAN) such as the Web. The data that forms the media stream, typically in fixed sized packets, suffers a delay, known as latency, between its transmission and receipt. That latency can and typically does vary between packets as network load and other conditions change. Not only does this pose a problem for the timely presentation of the media, but also the latency might vary so much between packets that they are received out of order whereas others might simply be lost (never received) or corrupted. Both ends of the media stream, the transmitter and receiver (or source and sink), have no control over these conditions when operating across a network such as the Internet. Transmission using appropriate protocols for communications such as RTP (Real-time Transfer Protocol) and RTCP (RTP Control Protocol) can aid the monitoring and, hence, detection of and possible compensation for such network induced problems. However it cannot fix them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset