As already mentioned in the introduction, visual saliency tries to describe the visual quality of certain objects or items that allows them to grab our immediate attention. Our brains constantly drive our gaze towards the important regions of the visual scene, as if it were to shine a flashlight on different sub-regions of the visual world, allowing us to quickly scan our surroundings for interesting objects and events while neglecting the less important parts.

It is thought that this is an evolutionary strategy to deal with the constant information overflow that comes with living in a visually rich environment. For example, if you take a casual walk through a jungle, you want to be able to notice the attacking tiger in the bush to your left before admiring the intricate color pattern on the butterfly's wings in front of you. As a result, the visually salient objects have the remarkable quality of popping out of their surroundings, much like the target bars in the following figure:

The visual quality that makes these targets pop out may not always be trivial though. If you are viewing the image on the left in color, you may immediately notice the only red bar in the image. However, if you look at the same image in grayscale, the target bar will be hard to find (it is the fourth bar from the top, fifth bar from the left). Similar to color saliency, there is a visually salient bar in the image on the right. Although the target bar is of unique color in the left image and of unique orientation in the right image, put the two characteristics together and suddenly the unique target bar does not pop out anymore:

In this preceding display, there is again one bar that is unique and different from all the other ones. However, because of the way the distracting items were designed, there is little salience to guide you towards the target bar. Instead, you find yourself scanning the image, seemingly at random, looking for something interesting. (Hint: The target is the only red and almost-vertical bar in the image, second row from the top, third column from the left.)

What does this have to do with computer vision, you ask? Quite a lot, actually. Artificial vision systems suffer from information overload much like you and me, except that they know even less about the world than we do. What if we could extract some insights from biology and use them to teach our algorithms something about the world? Imagine a dashboard camera in your car that automatically focuses on the most relevant traffic sign. Imagine a surveillance camera that is part of a wildlife observation station that will automatically detect and track the sighting of the notoriously shy platypus but will ignore everything else. How can we teach the algorithm what is important and what is not? How can we make that platypus "pop out"?

Fourier analysis

To find the visually salient sub-regions of an image, we need to look at its frequency spectrum. So far we have treated all our images and video frames in the spatial domain; that is, by analyzing the pixels or studying how the image intensity changes in different sub-regions of the image. However, the images can also be represented in the frequency domain; that is, by analyzing the pixel frequencies or studying how often and with what periodicity the pixels show up in the image.

An image can be transformed from the space domain into the frequency domain by applying the Fourier transform. In the frequency domain, we no longer think in terms of image coordinates (x,y). Instead, we aim to find the spectrum of an image. Fourier's radical idea basically boils down to the following question: what if any signal or image could be transformed into a series of circular paths (also called harmonics)?

For example, think of a rainbow. Beautiful, isn't it? In a rainbow, white sunlight (composed of many different colors or parts of the spectrum) is spread into its spectrum. Here the color spectrum of the sunlight is exposed when the rays of light pass through raindrops (much like white light passing through a glass prism). The Fourier transform aims to do the same thing: to recover all the different parts of the spectrum that are contained in the sunlight.

A similar thing can be achieved for arbitrary images. In contrast to rainbows, where frequency corresponds to electromagnetic frequency, with images we consider spatial frequency; that is, the spatial periodicity of the pixel values. In an image of a prison cell, you can think of spatial frequency as (the inverse of) the distance between two adjacent prison bars.

The insights that can be gained from this change of perspective are very powerful. Without going into too much detail, let us just remark that a Fourier spectrum comes with both a magnitude and a phase. While the magnitude describes the amount of different frequencies in the image, the phase talks about the spatial location of these frequencies. The following image shows a natural image on the left and the corresponding Fourier magnitude spectrum (of the grayscale version) on the right:

Fourier analysis

The magnitude spectrum on the right tells us which frequency components are the most prominent (bright) in the grayscale version of the image on the left. The spectrum is adjusted so that the center of the image corresponds to zero frequency in x and y. The further you move to the border of the image, the higher the frequency gets. This particular spectrum is telling us that there are a lot of low-frequency components in the image on the left (clustered around the center of the image).

In OpenCV, this transformation can be achieved with the Discrete Fourier Transform (DFT) using the plot_magnitude method of the Saliency class. The procedure is as follows:

  1. Convert the image to grayscale if necessary: Because the method accepts both grayscale and RGB color images, we need to make sure we operate on a single-channel image:
    def plot_magnitude(self):
        if len(self.frame_orig.shape)>2:
            frame = cv2.cvtColor(self.frame_orig, cv2.COLOR_BGR2GRAY)
            frame = self.frame_orig
  2. Expand the image to an optimal size: It turns out that the performance of a DFT depends on the image size. It tends to be fastest for the image sizes that are multiples of the number two. It is therefore generally a good idea to pad the image with zeros:
    rows, cols = self.frame_orig.shape[:2]
    nrows = cv2.getOptimalDFTSize(rows)
    ncols = cv2.getOptimalDFTSize(cols)
    frame = cv2.copyMakeBorder(frame, 0, ncols-cols, 0, nrows-rows, cv2.BORDER_CONSTANT, value = 0)
  3. Apply the DFT: This is a single function call in NumPy. The result is a 2D matrix of complex numbers:
    img_dft = np.fft.fft2(frame)
  4. Transform the real and complex values to magnitude: A complex number has a real (Re) and a complex (imaginary - Im) part. To extract the magnitude, we take the absolute value:
    magn = np.abs(img_dft)
  5. Switch to a logarithmic scale: It turns out that the dynamic range of the Fourier coefficients is usually too large to be displayed on the screen. We have some small and some high changing values that we can't observe like this. Therefore, the high values will all turn out as white points, and the small ones as black points. To use the gray scale values for visualization, we can transform our linear scale to a logarithmic one:
    log_magn = np.log10(magn)
  6. Shift quadrants: To center the spectrum on the image. This makes it easier to visually inspect the magnitude spectrum:
    spectrum = np.fft.fftshift(log_magn)
  7. Return the result for plotting:
    return spectrum/np.max(spectrum)*255

Natural scene statistics

The human brain figured out how to focus on visually salient objects a long time ago. The natural world in which we live has some statistical regularities that makes it uniquely natural, as opposed to a chessboard pattern or a random company logo. Probably, the most commonly known statistical regularity is the 1/f law. It states that the amplitude of the ensemble of natural images obeys a 1/f distribution, as shown in the figure later This is sometimes also referred to as scale invariance.

A 1D power spectrum (as a function of frequency) of a 2D image can be visualized with the plot_power_spectrum method of the Saliency class. We can use a similar recipe as for the magnitude spectrum used previously, but we will have to make sure that we correctly collapse the 2D spectrum onto a single axis.

  1. Convert the image to grayscale if necessary (same as earlier):
    def plot_power_spectrum(self):
        if len(self.frame_orig.shape)>2:
            frame = cv2.cvtColor(self.frame_orig, cv2.COLOR_BGR2GRAY)
            frame = self.frame_orig
  2. Expand the image to optimal size (same as earlier):
    rows, cols = self.frame_orig.shape[:2]
    nrows = cv2.getOptimalDFTSize(rows)
    ncols = cv2.getOptimalDFTSize(cols)
    frame = cv2.copyMakeBorder(frame, 0, ncols-cols, 0, nrows-rows, cv2.BORDER_CONSTANT, value = 0)
  3. Apply the DFT and get the log spectrum: Here we give the user an option (via flag use_numpy_fft) to use either NumPy's or OpenCV's Fourier tools:
    if self.use_numpy_fft:
        img_dft = np.fft.fft2(frame)
        spectrum = np.log10(np.real(np.abs(img_dft))**2)
        img_dft = cv2.dft(np.float32(frame), flags=cv2.DFT_COMPLEX_OUTPUT)
        spectrum = np.log10(img_dft[:,:,0]**2 + img_dft[:,:,1]**2)
  4. Perform radial averaging: This is the tricky part. It would be wrong to simply average the 2D spectrum in the direction of x or y. What we are interested in is a spectrum as a function of frequency, independent of the exact orientation. This is sometimes also called the radially averaged power spectrum (RAPS), and can be achieved by summing up all the frequency magnitudes, starting at the center of the image, looking into all possible (radial) directions, from some frequency r to r+dr. We use the binning function of NumPy's histogram to sum up the numbers, and accumulate them in the variable histo:
    L = max(frame.shape)
    freqs = np.fft.fftfreq(L)[:L/2]
    dists = np.sqrt(np.fft.fftfreq(frame.shape[0])[:,np.newaxis]**2 + np.fft.fftfreq(frame.shape[1])**2)
    dcount = np.histogram(dists.ravel(), bins=freqs)[0]
    histo, bins = np.histogram(dists.ravel(), bins=freqs,weights=spectrum.ravel())
  5. Plot the result: Finally, we can plot the accumulated numbers in histo, but must not forget to normalize these by the bin size (dcount):
    centers = (bins[:-1] + bins[1:]) / 2
    plt.plot(centers, histo/dcount)

The result is a function that is inversely proportional to the frequency. If you want to be absolutely certain of the 1/f property, you could take np.log10 of all the x values and make sure the curve is decreasing roughly linearly. On a linear x axis and logarithmic y axis, the plot looks like the following:

Natural scene statistics

This property is quite remarkable. It states that if we were to average all the spectra of all the images ever taken of natural scenes (neglecting all the ones taken with fancy image filters, of course), we would get a curve that would look remarkably like the one shown in the preceding image.

But going back to the image of a peaceful little boat on the Limmat river, what about single images? We have just looked at the power spectrum of this image and witnessed the 1/f property. How can we use our knowledge of natural image statistics to tell an algorithm not to stare at the tree on the left, but instead focus on the boat that is chugging in the water?

Natural scene statistics

This is where we realize what saliency really means.

Generating a Saliency map with the spectral residual approach

The things that deserve our attention in an image are not the image patches that follow the 1/f law, but the patches that stick out of the smooth curves. In other words, the statistical anomalies. These anomalies are termed the spectral residual of an image, and correspond to the potentially interesting patches of an image (or proto-objects). A map that shows these statistical anomalies as bright spots is called a saliency map.


The spectral residual approach described here is based on the original scientific publication by Xiaodi Hou and Liqing Zhang (2007). Saliency Detection: A Spectral Residual Approach. IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), p.1-8. doi: 10.1109/CVPR.2007.383267.

In order to generate a saliency map based on the spectral residual approach, we need to process each channel of an input image separately (single channel in the case of a grayscale input image, and three separate channels in the case of an RGB input image).

The saliency map of a single channel can be generated with the private method Saliency._get_channel_sal_magn using the following recipe:

  1. Calculate the (magnitude and phase of the) Fourier spectrum of an image, by again using either the fft module of NumPy or OpenCV functionality:
    def _get_channel_sal_magn(self, channel):
        if self.use_numpy_fft:
            img_dft = np.fft.fft2(channel)
            magnitude, angle = cv2.cartToPolar(np.real(img_dft), np.imag(img_dft))
            img_dft = cv2.dft(np.float32(channel), flags=cv2.DFT_COMPLEX_OUTPUT)
            magnitude, angle = cv2.cartToPolar(img_dft[:, :, 0], img_dft[:, :, 1])
  2. Calculate the log amplitude of the Fourier spectrum. We will clip the lower bound of magnitudes to 1e-9 in order to prevent a division by zero while calculating the log:
    log_ampl = np.log10(magnitude.clip(min=1e-9))
  3. Approximate the averaged spectrum of a typical natural image by convolving the image with a local averaging filter:
    log_ampl_blur = cv2.blur(log_amlp, (3, 3))
  4. Calculate the spectral residual. The spectral residual primarily contains the nontrivial (or unexpected) parts of a scene.
    magn = np.exp(log_amlp – log_ampl_blur)
  5. Calculate the saliency map by using the inverse Fourier transform, again either via the fft module in NumPy or with OpenCV:
        if self.use_numpy_fft:
            real_part, imag_part = cv2.polarToCart(residual,
            img_combined = np.fft.ifft2(real_part + 1j*imag_part)
            magnitude, _ = cv2.cartToPolar( np.real(img_combined), np.imag(img_combined))
            img_dft[:, :, 0], img_dft[:, :, 1] = cv2.polarToCart( residual, angle)
            img_combined = cv2.idft(img_dft)
            magnitude, _ = cv2.cartToPolar(img_combined[:, :, 0], img_combined[:, :, 1])
        return magnitude

The resulting single-channel saliency map (magnitude) is then returned to Saliency.get_saliency_map, where the procedure is repeated for all channels of the input image. If the input image is grayscale, we are pretty much done:

def get_saliency_map(self):
    if self.need_saliency_map:
        # haven't calculated saliency map for this frame yet
        num_channels = 1
        if len(self.frame_orig.shape)==2:
            # single channel
            sal = self._get_channel_sal_magn(self.frame_small)

However, if the input image has multiple channels, as is the case for an RGB color image, we need to consider each channel separately:

            # consider each channel independently
            sal = np.zeros_like(self.frame_small).astype(np.float32)
            for c in xrange(self.frame_small.shape[2]):
                sal[:, :, c] = self._get_channel_sal_magn(self.frame_small[:, :, c])

The overall salience of a multi-channel image is then determined by the average over all channels:

sal = np.mean(sal, 2)

Finally, we need to apply some post-processing, such as an optional blurring stage to make the result appear smoother:

        if self.gauss_kernel is not None:
            sal = cv2.GaussianBlur(sal, self.gauss_kernel, 
                sigmaX=8, sigmaY=0)

Also, we need to square the values in sal in order to highlight the regions of high salience, as outlined by the authors of the original paper. In order to display the image, we scale it back up to its original resolution and normalize the values, so that the largest value is one:

        sal = sal**2
        sal = np.float32(sal)/np.max(sal)
        sal = cv2.resize(sal, self.frame_orig.shape[1::-1])

In order to avoid having to redo all these intense calculations, we store a local copy of the saliency map for further reference and make sure to lower the flag:

        self.saliency_map = sal
        self.need_saliency_map = False

    return self.saliency_map

Then, when the user makes subsequent calls to class methods that rely on the calculation of the saliency map under the hood, we can simply refer to the local copy instead of having to do the calculations all over again.

The resulting saliency map then looks like the following image:

Generating a Saliency map with the spectral residual approach

Now we can clearly spot the boat in the water (lower-left corner), which appears as one of the most salient sub-regions of the image. There are other salient regions, too, such as the Grossmünster on the right (have you guessed the city yet?).


By the way, the reason these two areas are the most salient ones in the image seems to be clear and undisputable evidence that the algorithm is aware of the ridiculous number of church towers in the city center of Zurich, effectively prohibiting any chance of them being labeled as "salient".

Detecting proto-objects in a scene

In a sense, the saliency map is already an explicit representation of proto-objects, as it contains only the interesting parts of an image. So now that we have done all the hard work, all that is left to do in order to obtain a proto-object map is to threshold the saliency map.

The only open parameter to consider here is the threshold. Setting the threshold too low will result in labeling a lot of regions as proto-objects, including some that might not contain anything of interest (false alarm). On the other hand, setting the threshold too high will ignore most of the salient regions in the image and might leave us with no proto-objects at all. The authors of the original spectral residual paper chose to label only those regions of the image as proto-objects whose saliency was larger than three-times the mean saliency of the image. We give the user the choice to either implement this threshold, or to go with the Otsu threshold by setting the input flag use_otsu to true:

def get_proto_objects_map(self, use_otsu=True):

We then retrieve the saliency map of the current frame and make sure to convert it to uint8 precision, so that it can be passed to cv2.threshold:

    saliency = self.get_saliency_map()
    if use_otsu:
        _, img_objects = cv2.threshold(np.uint8(saliency*255), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

Otherwise, we will use the threshold thresh:

        thresh = np.mean(saliency)*255
        _, img_objects = cv2.threshold(np.uint8(saliency*255), thresh, 255, cv2.THRESH_BINARY)
    return img_objects

The resulting proto-objects mask looks like the following image:

Detecting proto-objects in a scene

The proto-objects mask then serves as an input to the tracking algorithm.

