Chapter 2. The Theory Behind Digital Images

Images are an essential part of human history. Film-based photography has made the creation of images easy—it captures a moment in time by allowing light to go through a lens and hit film, where an array of minuscule grains of silver-based compound change their brightness as a response to light intensity.

With the advent of computers, soon came the digitization of photos, initially through the scanning of printed images to digital formats, and then through digital camera prototypes.

Eventually, commercial digital cameras started showing up alongside film-based ones, ultimately replacing them in the public’s eye (and hand). Camera phones also contributed, with most of us now walking around with high-resolution digital cameras in our pockets.

A digital camera is very similar to a film-based one, except instead of silver grains it has a matrix of light sensors to capture light beams. These photosensors then send electronic signals representing the various colors captured to the camera’s processor, which stores the final image in memory as a bitmap—a matrix of pixels—before (usually) converting it to a more compact image format. This kind of image is referred to as a photographic image, or more commonly, a photo.

But that’s not the only way to produce digital images. Humans wielding computers can create images without capturing any light by manipulating graphic creation software, taking screenshots, or many other means. We usually refer to such images as computer-generated images, or CGI.

This chapter will discuss digital images and the theoretical foundations behind them.

Digital Image Basics

In order to properly understand digital images and the various formats throughout this book, you’ll need to have some familiarity with the basic concepts and vocabulary.

We will discuss sampling, colors, entropy coding, and the different types of image compression and formats. If this sounds daunting, fear not. This is essential vocabulary that we need in order to dig deeper and understand how the different image formats work.

Sampling

We learned earlier that digital photographic images are created by capturing light and transforming it into a matrix of pixels. The size of the pixel matrix is what we refer to when discussing the image’s dimensions—the number of different pixels that compose it, with each pixel representing the color and brightness of a point in two-dimensional space that is the image.

If we look at light before it is captured, it is a continuous, analog signal. In contrast, a captured image of that light is a discrete, digital signal (see Figure 2-1). The process of converting the analog signal to a digital one involves sampling, when the values of the analog signal are sampled in regular frequency, producing a discrete set of values.

Our sampling rate is a tradeoff between fidelity to the original analog signal and the amount of data we need to store and submit. Sampling plays a significant role in reducing the amount of data digital images contain, enabling their compression. We’ll expand on that later on.

hpim 0201
Figure 2-1. To the left, a continous signal; to the right, a sampled discrete signal

Image Data Representation

The simplest way to represent an image is by using a bitmap—a matrix as large as the image’s width and height, where each cell in the matrix represents a single pixel and can contain its color for a color image or just its brightness for a grayscale image (see Figure 2-2). Images that are represented using a bitmap (or a variant of a bitmap) are often referred to as raster images.

hpim 0202
Figure 2-2. Each part of the image is composed of discrete pixels, each with its own color

But how do we digitally represent a color? To answer that we need to get familiar with the following topics.

Color Spaces

We’ve seen that a bitmap is a matrix of pixels, and each pixel represents a color. But how do we represent a color using a numeric value?

In order to dive into that, we’ll need to take a short detour to review color theory basics. Our eyes are built similarly to the digital camera we discussed earlier, where the role of photosensitive electronic cells is performed by light-sensitive pigmented biological cells called rods and cones. Rods operate in very low light volumes and are essential for vision in very dim lighting, but play almost no part in color vision. Cones, on the other hand, operate only when light volumes are sufficient, and are responsible for color vision.

Humans have three different types of cones, each responsible for detecting a different light spectrum, and therefore, for seeing a different color. These three different colors are considered primary colors: red, green, and blue. Our eyes use the colors the cones detect (and the colors they don’t detect) to create the rest of the color spectrum that we see.

Additive Versus Substractive

There are two types of color creation: additive and subtractive. Additive colors are colors that are created by a light source, such as a screen. When a computer needs a screen’s pixel to represent a different color, it adds the primary color required to the colors emitted by that pixel. So, the “starting” color is black (absence of light) and other colors are added until we reach the full spectrum of light, which is white.

Conversely, printed material, paintings, and non-light-emitting physical objects get their colors through a subtractive process. When light from an external source hits these materials, only some light wavelengths are reflected back from the material and hit our eyes, creating colors. Therefore, for physical materials, we often use other primary subtractive colors, which are then mixed to create the full range of colors. In that model, the “starting” color is white (the printed page), and each color we add subtracts light from that, until we reach black when all color is subtracted (see Figure 2-3).

As you can see, there are multiple ways to re-create a sufficient color range from the values of multiple colors. These various ways are called color spaces. Let’s explore some of the common ones.

hpim 0203
Figure 2-3. Additive colors created by light versus substractive colors created by pigments (image taken from Wikipedia)

RGB (red, green, and blue)

RGB is one of the most popular color spaces (or color space families). The main reason for that is that screens, which are additive by nature (they emit light, rather than reflect light from an external light source), use these three primary pixel colors to create the range of visible colors.

The most commonly used RGB color space is sRGB, which is the standard color space for the W3C (World Wide Web Consortium), among other organizations. In many cases, it is assumed to be the color space used for RGB unless otherwise specified. Its gamut (the range of colors that it can represent, or how saturated the colors that it represents can be) is more limited than other RGB color spaces, but it is considered a baseline that all current color screens can produce (see Figure 2-4).

hpim 0204
Figure 2-4. The sRGB gamut (image taken from http://bit.ly/2aOUNt9)

CMYK (cyan, magenta, yellow, and key)

CMYK is a subtractive color space most commonly used for printing. The “key” component is simply black. Instead of having three components for each pixel as RGB color spaces do, it has four components. The reasons for that are print-related practicalities. While in theory we could achieve the black color in the subtractive model by combining cyan, magenta, and yellow, in practice the outcome black is not “black enough,” long to dry, and too expensive. Since black printing is quite common, that resulted in a black component being added to the color space.

YCbCr

YCbCr is actually not a color space on its own, but more of a model that can be used to represent gamma-corrected RGB color spaces. The Y stands for gamma-corrected luminance (the brightness of the sum of all colors), Cb stands for the chroma component of the blue color, and Cr stands for the chroma component of the red color (see Figure 2-6).

RGB color spaces can be converted to YCbCr through a fairly simple mathematical formula, shown in Figure 2-5.

hpim 0205
Figure 2-5. Formula to convert from RGB to YCbCr

One advantage of the YCbCr model over RGB is that it enables us to easily separate the brightness parts of the image data from the color ones. The human eye is more sensitive to brightness changes than it is to color ones, and the YCbCr color model enables us to harness that to our advantage when compressing images. We will touch on that in depth later in the book.

hpim 0206
Figure 2-6. French countryside in winter, top to bottom, left to right: full image, Y component, Cb component, and Cr component

YCgCo

YCgCo is conceptually very similar to YCbCr, only with different colors. Y still stands for gamma-corrected luminance, but Cg stands for the green chroma component, and Co stands for the orange chroma component (see Figure 2-7).

YCgCo has a couple of advantages over YCbCr. The RGB⇔YCgCo transformations (shown in Figure 2-8) are mathematically (and computationally) simpler than RGB⇔YCbCr. On top of that, YCbCr transformation may lose some data in practice due to rounding errors, whereas the YCgCo transformations do not, since they are “friendlier” to floating-point fractional arithmetic.

hpim 0207
Figure 2-7. French countryside in winter, top to bottom, left to right: full image, Y component, Cg component, and Co component
hpim 0208
Figure 2-8. Formula to convert from RGB to YCgCo (note the use of powers of 1/2, which makes this transformation easy to compute and float-friendly)

There are many other color spaces and models, but going over all of them is beyond the scope of this book. The aforementioned color models are all we need to know in order to further discuss images on the Web.

Bit depth

Now that we’ve reviewed different color spaces, which can have a different number of components (three for RGB, four for CMYK), let’s address how precise each of the components should be.

Color spaces are a continous space, but in practice, we want to be able to define coordinates in that space. The unit measuring the precision of these coordinates for each component is called bit depth—it’s the number of bits that you dedicate to each of your color components.

What should that bit depth be? Like everything in computer science, the correct answer is “it depends.”

For most applications, 8 bits per component is sufficient to represent the colors in a precise enough manner. In other cases, especially for high-fidelity photography, more bits per component may be used in order to maintain color fidelity as close to the original as possible.

One more interesting characteristic of human vision is that its sensitivity to light changes is not linear across the range of various colors. Our eyes are significantly more sensitive when light intensity is low (so in darker environments) than they are when light intensity is high. That means that humans notice changes in darker colors far more than they notice changes in lighter colors. To better grasp that, think about how lighting a candle in complete darkness makes a huge difference in our ability to see what’s around us, while lighting the same candle (emitting the same amount of photons) outside on a sunny day makes almost no difference at all.

Cameras capture light differently. The intensity of light that they capture is linear to the amount of photons they get in the color range that they capture. So, light intensity changes will result in corresponding brightness changes, regardless of the initial brightness.

That means that if we represent all color data as captured by our cameras using the same number of bits per pixel, our representation is likely to have too many bits per pixel for the brighter colors and too few for the darker ones. What we really want is to have the maximum amount of meaningful, visibly unique values that represent each pixel, for bright as well as dark colors.

A process called gamma correction is designed to bridge that gap between linear color spaces and “perceptually linear” ones, making sure that light changes of the same magnitude are equally noticeable by humans, regardless of initial brightness (see Figure 2-9).

hpim 0209
Figure 2-9. A view of the French countryside in winter, gamma-corrected on the left and uncorrected on the right

Encoders and decoders

Image compression, like other types of compression, requires two pieces of software: an encoder that converts the input image into a compressed stream of bytes and a decoder that takes a compressed stream of bytes and converts it back to an image that the computer can display.

This system is sometimes referred to as a codec, which stands for coder/decoder.

When discussing image compression techniques of such a dual system, the main thing to keep in mind is that each compression technique imposes different constraints and considerations on both the encoder and decoder, and we have to make sure that those constraints are realistic.

For example, a theoretical compression technique that requires a lot of processing to be done on the decoder’s end may not be feasible to implement and use in the context of the Web, since decoding on, e.g., phones, would be too slow to provide any practical value to our users.

Color Profiles

How does the encoder know which color space we referred to when we wrote down our pixels? That’s where something called International Color Consortium (ICC) or color profiles come in.

These profiles can be added to our images as metadata and help the decoder accurately convert the colors of each pixel in our image to the equivalent colors in the local display’s “coordinate system.”

If the color profile is missing, the decoder cannot perform this conversion, and as a result, its reaction varies. Some browsers will assume that an image with no color profile is in the sRGB color space and will automatically convert it from that space to the local display’s color space. At the same time, other browsers will send the image’s pixels to the screen as they are, effectively assuming that the color profile the images were encoded in matches the screen’s. That can result in some color distortion, so where color fidelity is important, color profiles are essential for cross-browser color correctness.

On the other hand, adding a color profile can add a non-negligable number of bytes to your image. A good tradeoff is to make sure your images are in the sRGB color space and add a fairly small sRGB color profile to them.

We will discuss how you can manage and control your images’ color profiles more in Chapter 14.

Alpha Channel

We discussed all the possible options we have to represent colors, but we left something out. How do we represent lack of color?

In some cases we want parts of our image to be transparent or translucent, so that our users will see a nonrectangular image, or otherwise will be able to see through the image onto its background.

The representation of the absence of color is called an alpha channel (see Figure 2-10). It can be considered a fourth color, where the zero value means that the other three colors are fully transparent, and a maximal value means that the other three colors are fully visible.

hpim 0210
Figure 2-10. An image with an alpha channel over different backgrounds; note the different colors of the dice edges (image taken from Wikipedia)

Frequency Domain

As we now know, we can break our images into three components: one brightness component and two color ones. We can think of each component as a two-dimensional function that represents the value of each pixel in the spatial domain, where the x- and y-axis are the height and width of the image, and the function’s value is the brightness/color value of each pixel (see Figure 2-11).

hpim 0211
Figure 2-11. The Y component of an image, plotted as a 2D function

Therefore, we can apply certain mathemetical transforms on these functions, in order to convert them from the spatial domain into the frequency domain. A frequency domain-based representation gives us the frequency at which each pixel value is changing rather than its value. Conversion to the frequency domain can be interesting, since it enables us to separate high-frequency brightness changes from low-frequency changes.

It turns out that another characteristic of human vision is that we notice high-frequency brightness and color changes significantly less than we notice low-frequency ones. If brightness or color is changing significantly from one pixel to the next and then back again, our eye will tend to “mush” these neighboring pixels into a single area with an overall brightness value that is somewhere in between.

We will expand on how this works when we talk about JPEGs in Chapter 4.

Image Formats

In the following chapters we will discuss the various image formats that are in common use today. But before we can dive into the details of each format, let’s explore the slightly philosophical question: what is image compression and why it is needed?

Why Image-Specific Compression?

As you may have guessed, image compression is a compression technique targeted specifically at images. While many generic compression techniques exist—such as Gzip, LZW, LZMA, Bzip2, and others—when it comes to raster images, we can often do better. These generic compression algorithms work by looking for repetitions and finding better (read: shorter) ways to represent them.

While that works remarkably well for text and some other types of documents, for most images, it’s not enough. That kind of compression can reduce the number of used bytes for bitmap images that have lots of pixels of exactly the same color right next to one another. While that’s great, most images—especially those representing real-life photography—don’t exhibit these characteristics.

So, pretty early on, various image compression techniques and related formats began to form and eventually a few formats were standardized. Many of these image compression techniques use generic compression techniques internally, but do so as part of a larger scheme that maximizes their benefits.

Raster Versus Vector

Raster versus vector-based images is the first fundamental divide when discussing image formats and compression techniques.

As previously mentioned, a raster image comprises a rectangular matrix called a bitmap. Each value in that matrix represents the color of a certain pixel that the computer can then copy to its graphics memory to paint to the screen.

Unlike raster images, vector images don’t contain the colors of individual pixels. Instead, they contain mathematical instructions that enable the computer to calculate and draw the image on its own.

While vector images can have many advantages over raster images in various scenarios, raster images are more widely applicable. They can be used for both computer-generated graphics as well as real-life photos, whereas vector images can only be efficiently used for the former.

Therefore, throughout the book, unless otherwise specified, we will mostly be referring to raster images, with the main exception being Chapter 6.

Lossy Versus Lossless Formats

Another characteristic that separates the various formats is whether or not they incur a loss of image information as part of the compression process. Many formats perform various “calculated information loss” in order to reduce the eventual file size.

Quite often that loss in image information (and therefore image precision and fidelity to the original) aims to reduce information that is hardly noticed by the human eye, and is based on studies of human vision and its characteristics. Despite that, it’s not unheard of for precision loss to be noticeable, which may be more critical for some applications than others.

Therefore, there are both lossy and lossless image formats, which can answer those two different use cases: image compression that maintains 100% fidelity to the original versus compression that can endure some information loss while gaining compression ratio.

Lossy Versus Lossless Compression

While the formats themselves can be lossy or lossless, there are various examples where images can undergo lossy as well as lossless compression, regardless of the target format. Metadata that is not relevant to the image’s display (where the image was taken, camera type, etc.) can be removed from images, resulting in arguably lossless compression even if the target format is lossy. Similarly, image information can be removed from the image before it is saved as the target format, resulting in lossy compression of a lossless image format.

One exception to that case is that you cannot save an image losslessly in a format that only has a lossy variant. This is because these formats usually apply some degree of loss as part of their encoding process, and that cannot be circumvented.

We will further discuss lossless and lossy compression in Chapter 14.

Prediction

Often, the encoding and decoding processes both include some guess of what a pixel value is likely to be, based on surrounding pixel values, and then the actual pixel value is calculated as the offset from the “expected” color. That way we can often represent the pixel using smaller, more compressible values.

Entropy Encoding

Entropy encoding is a very common generic compression technique that is used to give the most frequent symbols the shortest representation, so that the entire message is as compact as possible. Entropy coding is often used in image compression to further compress the data after the main image-specific parts are performed.

Since entropy encoding requires us to know what the most frequent symbols are, it typically involves two steps. The first pass gathers statistics regarding the frequency of words in the data, and then creates a dictionary translating those words into symbols from the frequency data. The second pass translates the words into shorter symbols using the previously created dictionary.

In some domains, where word frequency is known in advance with a good enough approximation, the first step is skipped and a ready-made frequency-based dictionary is used instead. The result is a potentially slightly larger data stream, but with the advantage of a single-pass algorithm that is faster and possible to perform on the fly.

When you are compressing content using entropy encoding, the dictionary used for the encoding has to be present in the decoder as well. Sending the dictionary data adds a “cost” to entropy encoding that somewhat reduces its benefits.

Other types of entropy encoding permit adaptive encoding, where a single pass over the data is enough. Such encodings count the frequency and assign codes to symbols as they go, but change the code assigned to each symbol as its frequency changes.

Relationship with Video Formats

One important thing to keep in mind about image formats is that they share many aspects with video formats. In a way video formats are image formats with extra capabilities that enable them to represent intermediary images based upon previous full images, with relatively low cost. That means that inside every video format, there’s also an image format that is used to compress those full images. Many new efforts in the image compression field come from adopting compression techniques from the video compression world, or adopting the still image encoding parts (called I-frame encoding) from video formats and building an image format based on that (e.g., WebP, which we will discuss later on).

Comparing Images

Comparing the quality of an image compressed using different settings, different encoders, or different formats is not a trivial task when it comes to lossy compression. Since the goal of lossy image compression is achieving quality loss that, to some extent, flies under most people’s radar, any comparison has to take both the visual quality of the image and its eventual byte size into account.

If you’re trying to compare the quality and size of a single image, you can probably do so by looking at the image output of different encoding processes and trying to “rank” the variants in your head, but that is hardly scalable when you have many images to compare, and it is impossible to automate.

As it turns out, there are multiple algorithms that try to do just that. They give various “scores” when comparing the compressed images to their originals, enabling you to tune your compression to the visual impact compression would have on the image, rather than to arbitrary “quality” settings.

PSNR and MSE

The Peak Singal-to-Noise Ratio (PSNR) is a metric that estimates the ratio of error introduced by the compression alogorithm. It often uses Mean-Square-Error (MSE) in order to do that. In a nutshell, MSE is the average mathematical distance of the pixels in the compressed image from the original one. PSNR calculates that and uses the ratio between the maximum possible pixel value to the MSE in order to estimate the impact of compression on the image.

That method works to estimate divergence from the original, but it’s not necessarily tied to the impact of that divergence on the user’s perception of the compressed image. As we’ll see later on, some formats rely on further compressing parts of the image that are less noticeable by the human eye in order to achieve better compression ratios with little perceived quality loss. Unfortunately, PSNR and MSE don’t take that into account, and therefore may be skewed against such formats and techniques.

SSIM

Structural Similiarity (SSIM) is a metric that tries to take the image’s structure into account when calculating the errors in the image. It operates under the assumption that human visual perception is adapted to extract structural information, and therefore deterioration in the structural contents of an image would mean that it would be perceived as a lower-quality image.

The algorithm estimates structural changes by comparing the intensity and contrast changes between pixel blocks in both the original and compressed image. The larger the intensity and contrast differences are, the more “structural damage” the compressed image’s pixel blocks have sustained.

The result of the algorithm is an average of those differences, providing a score in the range of 0 to 1.

When the result is 1 the compressed image is a perfect replica of the original image, and when it is close to 0 very little structural data remains.

So when using SSIM for compression tuning, you want to aim at close to 1 values for “barely noticeable” compression, and lower values if you’re willing to compromise image quality for smaller files.

SSIM also has a multiscale variant (MS-SSIM), which takes multiple scales of both images into account when calculating the final score.

There’s also the Structural Dissimilarity metric (DSSIM), which is very similar to SSIM, but has an inverse range, where 0 is the perfect score and 1 means that the compressed image has no resemblance to the original.

Butteraugli

Butteraugli is a recent visual comparison metric from Google that aims to be even more accurate than SSIM in predicting perceived image quality. The metric is based on various anatomic and physiological observations related to the human eye structure.

As a result, the algorithm “suppresses” the importance of some colors based on the differences in location and density of different color receptors, calculates frequency domain image errors (putting more weight on low-frequency errors, as they are more visible than high-frequency ones), and then clusters the errors, as multiple errors in the same area of the image are likely to be more visible than a single one.

It is still early days for that metric, and it is mostly tuned for high-quality images, but initial results look promising.

Summary

In this chapter, we went through the basic terms and concepts we use when discussing digital images and the various image formats. In the following chapters we will make good use of this knowledge by diving into the details of what each format does and how.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset