7Motion Analysis

In order to analyze the change of the image information or moving objects, an image sequence (also known as motion pictures) should be used. Image sequence is a composition of a series of consecutive time variant 2-D (space) images or is often considered as a kind of 3-D images that can be expressed as f(x, y, t). Compared to the still image f(x, y), the time variable t has been added. When t takes a certain value, an image (a frame) in the sequence is obtained. Different from a single image, the image sequence (video is one kind of commonly used and will be used in the following) is continuously collected to reflect the change of moving objects and change of scenery. On the other hand, objective things are constantly moving and changing. The movement is absolute, while the stationary is relative. The scene changes and object motion in the image sequence was quite obvious and clear.

The analysis of motion in image sequence could be based on the analysis of a still image. However, the expansion in technology, change in the means, and the broadened of purpose are required.

The sections of this chapter are arranged as follows:

Section 7.1 overviews the subjects of the motion studies, including motion detection, locating and tracking moving objects, moving object segmentation and analysis as well as 3-D scene reconstruction and motion/scene understanding.

Section 7.2 discusses the motion information detection problem. The principles of motion detection using the image difference and using the model-based motion detection are introduced.

Section 7.3 focuses on motion detection especially moving object (target) detection. It introduces the techniques and effects of background modeling, and the computing of optical flow field and its application in the motion analysis.

Section 7.4 discusses the segmentation of moving objects in image sequences. First, the relation between moving object segmentation and motion information extraction is analyzed, and then, the dense optical flow algorithm based on brightness gradient is introduced. The ideas of segmentation and practical methods based on parameters and models are described.

Section 7.5 describes the typical motion tracking or moving object tracking technology, including the Kalman filter, particle filter, mean shift, and kernel tracking techniques as well as the tracking strategy using subsequences decision.

7.1The Purpose and Subject of Motion Analysis

Compared to texture and shape, the concept of motion is relatively straightforward. The objective and tasks of motion analysis may include the following.

7.1.1Motion Detection

Motion detection means to detect whether there is movement information in the scene (including global motion and local motion). In this case, only a single fixed camera is usually required. A typical example is the security surveillance, then any factors leading to changes in the image are required to be taken into account. Of course, due to the changes in lighting is relatively slow while the changes caused by the movement of objects would be relatively quick, these changes should be further distinguished.

7.1.2Locating and Tracking Moving Objects

The concern here is mainly to discover whether there are moving objects in the scene; if an object is existing, what is the current position; further, it may need to find its trajectory and to predict its next moving direction and tendency as well as future moving route. For such a situation, only a single fixed camera could be generally sufficient. In practice, the actual camera may be still while the object is moving, or the camera moving while the object is stationary. If both of them are moving, this would be the most complex case.

According to different research purposes, different techniques may be employed. If it is required only to determine the position of moving objects, then some motion segmentation methods can be used. In this case, the initial segmentation of moving objects can be made by means of the motion information. If further determination of the direction of movement and the trajectory of the moving object are required, or even the prediction of motion tendency is wanted, then some matching techniques would be used to establish the relationship among the object (target) image data, the object characteristics, or the graph representing a moving object.

Moving object localization is often considered as the synonymous with moving object detection, but it is more concerned about the object location instead of object’s characteristics. Often some assumptions are made:

1.Maximum speed: If the position of a moving object is known in the previous frame, then its position in current frame would be inside a circle whose center is at the position in last frame and whose radius is the maximum speed of object motion;

2.Small acceleration: the change of the object velocity is limited, it is often predictable as it has more or less some regularity;

3.Mutual correspondence: rigid object will maintain a stable pattern in the image sequence, and the object in scene has correspondence with point in image;

4.Joint motion: If there are multiple object points, their modes of motion are related (similarity).

7.1.3Moving Object Segmentation and Analysis

The moving object segmentation goes further than just moving object location. More than the position determination of the object, this process needs to accurately extract the object so as to obtain its pose. The moving object analysis will provide the identity and motion parameters of the object, so as to get the movement law, and determine the motion type, etc. In this situation, it is often needed to use a video camera to obtain an image sequence, to make the distinction between global motion and local motion. Also it is often needed to obtain 3-D characteristics of objects, or to further identify the category of the moving object.

7.1.4Three-Dimensional Scene Reconstruction and Motion Understanding

On the basis of moving information and object motion information, the depth/distance of 3-D objects from camera, the surface orientation of objects, and the occlusion of several objects can be computed. On the other hand, by combining moving information and other information extracted from images, the causal relationship of motion could be determined. Further aided by scene knowledge, the interpretation of scene and motion, the understanding of behavior of moving objects could also be learned. This will be discussed in Volume III of this book set.

7.2Motion Detection

To understand the change of a scenario, it is first required to detect motion information, which determines whether there is movement, what are moving and where are changed. Secondly, it is required to estimate the motion, that is, to determine the parameters of movement (magnitude and direction, etc.). The second step is also known as motion estimation, but in many cases, these two steps are still collectively known as motion detection. Continuously, motion detection is very unique for video image processing and is the basis for many tasks using video as input.

Detection of motion often means to detect the motion information of the whole image. As indicated in the previous section, there are both foreground and background motions in video, so motion detection is necessary to detect changes caused by the movement of the whole scene, and also to detect changes caused by the movement of the specific objects.

7.2.1Motion Detection Using Image Differences

In a video, the difference between the two consecutive (before and after) frame images can be found by a pixel-by-pixel comparison. Suppose lighting conditions do not change substantially in multiple frames, then the place where the difference image has nonzero values indicates that the pixel at this place has been moved/changed (to be noted that at the place where the difference image has zero value the pixel may also be moved). In other words, taking the difference between two adjacent frame images can find the location and shape change of moving objects in the image.

7.2.1.1Calculating a Difference Image

Referring to Figure 7.1(a), suppose the object is brighter than the background. With the help of difference image, one positive region in the front of movement and one negative region in the rear of movement can be obtained. The moving information of the object, or the shape of some parts of an object can be further obtained. If taking the differences for all adjacent two images in a series, and making the logic AND of positive regions and negative regions, respectively, the shape of an entire object can finally be found. Figure 7.1(b) illustrates this procedure, a rectangular region is gradually moved down and sequentially travel across the different parts of the oval object, combining the results of all times, a complete ellipse object is produced.

If a series of images have been collected under the relative motion between the image acquisition device and the scene being shot, then the pixels that have been changed can be determined with the help of the motion information presented in the image. Suppose the two images f(x, y, ti) and f(x, y, tj) were collected at the time ti and tj, respectively, then the difference image obtained is

dij(x,y)={1   |f(x,y,ti)f(x,y,tj)|>Tg0   otherwise              (7.1)

where Tg is the gray-level threshold. The pixel with value 0 in difference image corresponds to the place where no change occurs between the time before and after the current time (for changing arising from motion). The pixel with value 1 in difference image corresponds to the place where change occurs between the time before and after the current time. This change is often caused by the motion of object. However, the pixel with value 1 may also arising from other different circumstances, such as f(x, y, ti) is a pixel belonging to the moving object while f(x, y, tj) is a pixel belonging to background, or vice versa. Other examples include that f(x, y, ti) is a pixel belonging to a moving object, while f(x, y, tj) is a pixel belonging to another moving object, or f(x, y, tj) is a pixel belonging to the same moving object but at another location (so the gray-level may be different). Example 3.4 in Volume I of this book set has shown an instance that the object moving information can be detected by using image difference.

images
Figure 7.1: Using difference image to extract object.

The threshold Tg in eq. (7.1) is used to determine whether the gray-levels of two consecutive images exists obvious difference. Another method is to use the likelihood ratio to identify if there is significant difference:

[σi+σj2+(μi+μj2)2]2σiσj>Ts                       (7.2)

where μi, μj and σi, σj are the mean and variance of two collected images at time ti and tj, respectively; Ts is the significance threshold.

In real applications, due to the influence of random noise, the places where no pixels to move could also have some nonzero difference values between two images. To distinguish the movement of pixels with the influence of noise, it is possible to use a larger threshold value for the difference image, that is, when the difference is greater than a pre-determined threshold, then the pixels would be considered as had moved. In addition, as the pixels with value 1 that caused by noise in image are generally more isolated, so these pixels can also be removed by connectivity analysis. However, such an approach could sometimes remove also those pixels belonging to smaller objects and/or belonging to slow motion objects.

7.2.1.2Calculating a Cumulative Difference Image

To overcome the above-mentioned problem with random noise, using multiple images can be considered. If the change at one location only appears occasionally, it can be judged as noise. Let a series of image be f(x, y, t1), f(x, y, t2), ..., f(x, y, tn), and the first image f(x, y, t1) be the reference image. By comparing the first image with each of subsequent image, a cumulative difference image(ADI) can be obtained. In this image, the value at each location is the sum of the number of changes in each comparison.

One example for cumulative difference image ADI is given in Figure 7.2. Figure 7.2(a) shows an image captured at time t1, there is a square object inside, which is moved horizontally to the right one pixel per unit time. Figure 7.2(b, c) represents the images captured at time t2 and t3 (after one and two time units), respectively. Figure 7.2(d, e) corresponds to cumulative difference images for the next time t2 and t3, respectively. Figure 7.2(d) is the common difference image discussed earlier, the left square marked with 1 corresponds to the gray-level difference (as a unit) between the trailing edge of the object in Figure 7.2(a) and the background in Figure 7.2(b), and the right square marked with 1 corresponds to the gray-level difference (also as a unit) between the background in Figure 7.2(a) and the leading edge of the object in Figure 7.2(b). Figure 7.2(e) can be obtained by adding Figure 7.2(d) to the gray-level difference of Figure 7.2(a, c), in which the gray difference between 0 and 1 is two units, and the gray difference between 2 and 3 is also two units.

images
Figure 7.2: Using accumulated difference image to extract object.

Referring to the example above, it is shown that the cumulative difference image ADI has three functions:

1.In ADI, the gradient relationship between the values of adjacent pixels can be used to estimate the velocity vector of object movement, where the gradient direction is the direction of the velocity, and the gradient magnitude is proportional to the magnitude of velocity.

2.The pixel values in ADI can help to determine the size and the moving distance of moving object.

3.ADI includes all the historical data of the object motion, and is helpful for detecting slow motion and motion of smaller objects.

In practical applications, three types of ADI can be distinguished Gonzalez (2008): absolute ADI (Ak(x, y)), positive ADI (Pk(x, y)) and negative ADI (Nk(x, y)). Assuming the gray-level of moving object is larger than that of background, then for k > 1, the three types of definitions of ADI (taking f(x, y, t1) as reference, and Tg is as above) are:

Ak(x,y)={Ak1(x,y)+1|f(x,y,t1)f(x,y,tk)|>TgAk1(x,y)     otherwise            (7.3)

Pk(x,y)={Pk1(x,y)+1|f(x,y,t1)f(x,y,tk)|>TgPk1(x,y)     otherwise            (7.4)

Nk(x,y)={Nk1(x,y)+1|f(x,y,t1)f(x,y,tk)|<TgNk1(x,y)     otherwise            (7.5)

The above three types of ADI values are all the results of the pixel counting, they are initially zero. The following information can be obtained from them:

1.The nonzero area of positive ADI is equal to that of moving object.

2.The position corresponding to the moving object in ADI is that of moving object in the reference image.

3.When the moving object in positive ADI is moved to the place that is not coincide with the moving object in reference image, the counting of positive ADI stops.

4.The absolutely ADI includes all object regions in both positive ADI and negative ADI.

5.The movement direction and speed of moving objects can be determined on the basis of the absolute ADI and the negative ADI.

7.2.2Model-Based Motion Detection

Motion detection can also be carried out by means of the motion model. In the following, the camera model is considered for global motion detection.

Assuming that the global motion vector (MV) [u, v]T at a point (x, y) in image can be calculated by its spatial coordinates and a set of model parameters (k0, k1, k2,...), the common model can be expressed as

{u=fu(x,y,k0,k1,k2,)v=fv(x,y,k0,k1,k2,)                 (7.6)

When the model parameters were estimated, a sufficient number of observation points should first be selected from adjacent frames. It is followed by using some matching algorithm to derive the MVs of these points, and the parameter fitting method is finally applied to estimate model parameters. Many methods have been proposed to estimate the global motion model, each of them have their own characteristics in selecting the observation points, in aspects of matching algorithms, motion models and motion estimation methods.

Equation (7.6) represents a general model. In practice, often more simplified models are used. Commonly considered motion types of camera have six kinds:

1.Panning: the camera (axis) rotates horizontally;

2.Tilting: the camera (axis) rotates vertically;

3.Zooming: the camera changes its focal length (long or short);

4.Tracking: the camera moves horizontally (laterally);

5.Booming: the camera moves vertically (transversely);

6.Dolling: the camera moves forth and back (horizontally).

These six types of camera movement can also combined to form new operations that constitute three categories (Jeannin 2000):

1.Shifting operation;

2.Rotating operation;

3.Scaling operation.

For general applications, the linear affine model with 6 parameters is:

{u=k0x+k1y+k2v=k3x+k4y+k5                (7.7)

Affine model belongs to the linear polynomial parameter model that is mathematically easier to handle. In order to improve the description ability of the global motion model, some extensions can be made on the basis of affine model. For example, by adding the quadratic term xy into the polynomial model, a bilinear parameter model can be obtained:

{u=k0xy+k1x+k2y+k3v=k4xy+k5x+k6y+k7                (7.8)

A global MV detection method, based on bilinear model, is as follows Yu (2001b). To make an estimate of the eight parameters in bilinear models, it is required to find a group (more than four) MVs with observation values (eight equations can be provided). When obtaining the MV of observations, taking into account the relatively large global motion values of global motion, the whole frame image is often divided into a number of square pieces (such as 16 × 16), and then the observed MVs are computed by using a block matching method. By selecting a larger matching block size, the offset between the global MV and the matching MV caused by local motion can be reduced, in order to obtain a more accurate global motion observations.

Example 7.1 Global motion detection based on the bilinear model.

A real example of global motion detection based on bilinear model is presented in Figure 7.3, in which the MVs obtained by using the block matching algorithm are superimposed on the original image (as a short line segment starting from the center of block) to express the movement of each block.

It is seen that in this image, the right part has some motions with higher velocity. This is because that the center of camera’s zooming is located at (or the optical axis is pointed to) the left part of goalkeeper. On the other side, there are still some motions caused by local objects (football players), so at these locations the MVs computed by block matching algorithm would be different from the global MVs (such as shown in the vicinity of the location of each football player). In addition, the block matching method may generate some random error data in the low texture regions of the image, as in the background (near the grandstand). The presence of these errors could result the abnormal data points in the image. images

images
Figure 7.3: Motion vector values obtained by the direct block matching algorithm.

7.3Moving Object Detection

In the above section, some basic motion detection methods have been introduced. Here two further technical categories are presented, which are more suitable for the detection of moving objects (with local motion).

7.3.1Background Modeling

Background modeling is a wide-ranging idea for motion and object detection, and can be realized with different techniques, so it is also seen as a general term for a class of motion detection method.

7.3.1.1Basic Principle

Motion detection is to find the motion information in the scene. An intuitive approach is to compare the current frame to be checked with the original background without motion information, the difference in the comparison indicates the result of movement. First consider a simple case. There is a moving object in a stationary scene (background), then the difference caused by the movement of this object in adjacent two video frames would appear at the corresponding place. In this case, by computing a difference image (as above) could detect moving objects and locate their positions.

Calculating a difference image is a simple and fast method for motion detection, but the result would not be good enough in a number of cases. This is because the calculation of difference images will detect all light variations, environment ups and downs (background noise), camera shake, etc. together with the object motion. This problem is more serious especially when the first frame is taken as the reference frame. Therefore, the true motion of objects can only be detected in very tightly controlled situations (such as in the unchanged environment and background).

A reasonable idea for motion detection is not to consider that the background is entirely stationary, but to calculate and maintain a dynamic (satisfy some models) background frame. This is the basic idea of background modeling.

There is a simple background modeling method, which uses the mean or median of N frames prior to the detection of the current frame, to determine and update the value of each pixel in the N cycle frame. One particular algorithm includes the following steps:

1.Acquire first pre-N frame images, determine the values of the N frames at each pixel location, and take their average value as the current background value;

2.Acquire then the N + 1 frame image, compute the difference between the current frame and the current background at each pixel (the difference can be thresholded to eliminate or reduce noise);

3.Use smooth or a combination of morphological operations to remove those very small difference and to fill holes in large regions. Preserved regions should represent the moving objects in the scene;

4.Update the average value at each pixel location by combining the information of the N + 1 frame image;

5.Return to step (2), and consider the next frame image.

This approach based on average values for maintaining the value of the background is relatively simple, with small amount of calculation, but the result would be not very good when there are multiple objects or slow motion objects in the scene.

7.3.1.2Typical Methods in Applications

Here, some typical and basic background modeling methods used in real situations are introduced. They divide the foreground extraction process into two steps: model training and actual detection. A mathematical model for background is established through the training, and the model is used in detection to eliminate background and to obtain the foreground.

Approach Based on Single Gaussian Model Single Gaussian model-based approach considers that the values of the pixels follow a Gaussian distribution in the video sequence. Concretely, for each fixed pixel location, the mean μ and variance σ of the position of the pixel values in N frames of training sequence are calculated, from which a unique single Gaussian background model is identified. During the motion detection, the method of background subtraction is carried out to calculate the difference between the pixel values in current frame image and the pixel values of the background of the model, then the difference is compared to the threshold value T (often take 3 times of variance), i. e. according to |μTμ| ≤ 3σ, the pixel can be determined as belonging to foreground or background.

This model is relatively simple but requires more stringent application conditions. For example, it requires no significant changes in light intensity in a long time, while the moving foreground has small shadow in the background during the period of detection. Another disadvantage is the high sensitivity for the change of light (light intensity), which can cause the model does not hold (both mean and variance are changed). When moving foreground existing in the scene, it may cause large false alarm rate. This is because there is only one model, the moving foreground cannot be separated from stationary background.

Approach Based on Video Initialization In the case that the background of the training sequence is stationary, but moving foreground exists, if it is able to first extract out the background values of each pixel, and to separate moving foreground from background, then the background modeling can be carried out, the foregoing problem would be overcame. This process can also be seen as initializing the training video before the background modeling, so that the influence of moving foreground on the background modeling could be filtered out.

In practice, a minimum length threshold T1 is first set for N frames training images containing moving foreground, and then, this threshold is used to select the sequence of length N for each pixel location in the sequence, to obtain several sub-sequence {Lk}, k = 1,2,..., from which the sequence with longer length and smaller variance will be selected as the background sequence.

By this initialization, the case that background is stationary while moving foreground exists in the training sequence is transformed to the case that background is stationary and no moving foreground in the training sequence. In this situation, it is still possible to use the approach based on single Gaussian model (as discussed above) for background modeling.

Approach Based on Gaussian Mixture Model When the movement exists also in the background of training sequence, the result based on the method of single Gaussian model is not very suitable. In this case a more robust and effective approach is modeling each pixel by mixed Gaussian distribution, namely Gaussian mixture model(GMM). In other words, the modeling is made for each state in background. The model parameters of states are updated according to the data belonging to the state, in order to solve the problem of moving background problem under the background modeling. According to local nature, some Gaussian distributions represent foreground while others represent background. The following algorithm can distinguish these Gaussian distributions.

The basic method based on GMM is sequentially reading N frames of training images, each time the iterative modeling is carried out for every pixel. Suppose that a pixel having a gray level f(t), t = 1,2,..., at the time t, f(t) can be modeled by using K (K is the maximum number of pixels allowed for each model) Gaussian distributions N(μk,σk2), where k = 1,..., K. Gaussian distribution changes over time with the change of scene, so it is a function of time, and can be written as

Nk(t)=N[μk(t),σk2(t)],      k=1,...,K                 (7.9)

The main concern for the choice of K is computational efficiency, and K is often taking values of 3–7.

At the beginning of the training, an initial standard deviation is set. When a new image is read, the pixel value of this image is used to update the original pixel values of the background model. For each Gaussian distribution a weight wk(t) is added (the sum of all weights is 1), so the probability of observing f(t) is

P[f(t)]=k=1Kwk(t)12πexp[[f(t)μk(t)]2σk2(t)]                          (7.10)

EM algorithm can be used to update the parameters of the Gaussian distribution, but it is computationally expensive. An easy way is to compare each pixel with the Gaussian function, if it falls within the average range of 2.5 times the variance, then it is considered to be a match, that is, it fits to the model, so it can be used to update the values of the mean and variance of the model. If the number of models for current pixel is less than K, a new model is to be established for this pixel. If there are more than one matches appearing, the best can be chosen.

If a match is found, then for a Gaussian distribution l:

wk(t)={(1α)wk(t1)k1wk(t1)           k=1                     (7.11)

Then w is renormalized. In eq. (7.11), a is a learning constant, 1/a determines the rate of parameter change. The parameters to match the Gaussian function can be updated as follows:

μk(t)=(1b)μ1(t1)+bf(t)               (7.12)

σk2(t)=(1b)σl2(t1)+b[f(t)μk(t)]2                 (7.13)

where

b=aP[f(t)μl,σl2]                    (7.14)

If no match is found, the Gaussian distribution corresponding to the minimum weight can be replaced by a new Gaussian distribution with mean f(t). Compared to other K – 1 Gaussian distributions, it has a higher variance and lower weight, so it may become part of the local background. If K models have been checked and they do not meet the conditions, then the model corresponding to the smallest weight is replaced with a new model, the mean value of the new model is the pixel value, and an initial standard deviation is set. Continuous in this way until all the training images are trained.

At this moment, the Gaussian distribution the most likely to be assigned to the gray value of the current pixel can be determined. Then, it is required to make sure whether it belongs to the foreground or the background. This may be determined by means of a constant B corresponding to the whole process of observation. Suppose the ratios of background pixels in all frames are greater than B, then all Gaussian distributions can be ranked according to wk(t)/σk(t), a higher value indicates a large weight, or a small variance, or both. These cases correspond to the situations that the given pixel is likely to belong to the background.

Approach Based on Codebook In the codebook based method, each pixel is represented by a codebook, a codebook may comprise one or more code words, and each code word represents a state (Kim, 2004). The initial code is learnt by means of a group of training frame images. There are no restrictions on the content of the training frame images that can contain moving foreground or moving background. Next, a time-domain filter is used for filtering out the code words representing a motion of the foreground and keeping the code words representing background; and a spatial filter is used to recover those codewords wrongly filtered out, so as to reduce the false alarm caused by the occurrence of sporadic foreground regions in the background. Such codebook represents a compressed form of a background model in video sequence.

7.3.1.3Some Experiment Results

Background modeling is a training-testing process. It uses some earlier frame images in the beginning of a sequence to train a background model. This model is then applied to the rest frame images, and the motion is tested according to the difference between the current frame and the background model. In the simplest case, the background is stationary in the training sequence, and there is no moving foreground. Some complicated situations include: the background is stationary, but there is moving foreground in training sequence; the background is not static, but there is no moving foreground in the training sequence. The most complicated situation is that the background is not static, and there is also moving foreground in the training sequence. In the following, some experiment results obtained by using background modeling for the first three cases are illustrated (Li, 2006).

Experimental data are from three sequences of an open access, universal video library (Toyama, 1999). There are a total of 150 color images in a sequence, each of them has a resolution of 160 × 120. During the experiment, image editing software has been used to provide the binary reference results, then each of the background modeling method is used to detect the moving object in the sequence, and a binary test result is obtained. For each sequence, 10 images are selected for testing. The test results are compared with the reference results, the average detection rate (the ratio of the number of foreground pixels detected over the real number of foreground pixels) and the average false alarm rate (the ratio of the number of no-foreground pixels detected over the number of pixels detected as foreground) are collected.

Results for Stationary Background with No Moving Foreground A set of experiment results are shown in Figure 7.4. In the sequence used, the initial scene consists of only stationary background, the goal is to detect the person subsequently entered in the scene. Figure 7.4(a) is a scene after the entering of a person. Figure 7.4(b) shows the reference result. Figure 7.4(c) gives the test results obtained with a method based on a single Gaussian model. The detection rate of this method is only 0.473, while the false alarm rate is 0.0569. It is seen from Figure 7.4(c) that a lot of pixels have not been detected (in regions of low gray-level pixels), some error detected pixels on the background have be found, too.

Results for Stationary Background with Moving Foreground A set of experiment results are shown in Figure 7.5. In the sequence used, the initial scene has a person inside but was leaved later, the goal is to detect the leaved person. Figure 7.5(a) is a scene when the person has not left. Figure 7.5(b) shows the reference result. Figure 7.5(c) gives the test result obtained with a method based on video initialization. Figure 7.5(d) gives the test result obtained with a method based on codebook.

images
Figure 7.4: Results for stationary background with no moving foreground.
images
Figure 7.5: Results for stationary background with moving foreground.

Table 7.1: Statistical results for stationary background with moving foreground.

MethodDetection rateFalse alarm rate
Based on video initialization0.6760.051
Based on codebook0.8800.025

The comparison of these two methods shows that the detection ratio of the method based on codebook is higher than that of the method based on video initialization, and the false alarm rate of the method based on codebook is lower than that of the method based on video initialization. This is because the codebook method has construct many codewords for each pixel so the detection rate is improved, while the spatial filter used in detection process reduces the false alarm rate. Some statistical results are shown in Table 7.1.

Results for Moving Background Without Moving Foreground A set of experiment results are shown in Figure 7.6. In the sequence used, the initial scene has a shaking tree in the background, the goal is to detect the person getting in after then. Figure 7.6(a) is a scene after the person has entered. Figure 7.6(b) shows the reference result. Figure 7.6(c) gives the test result obtained with a method based on Gaussian mixing model. Figure 7.6(d) gives the test result obtained with a method based on codebook.

The comparison of these two methods shows that both methods have specifically designed models for moving background so that higher detection rates are achieved (The former has a little high rate than the latter). Because the former one does not have the processing steps corresponding to the spatial filter in the latter method, the false alarm rate of the former method is a little high than that of the latter. Specific statistical data are shown in Table 7.2.

images
Figure 7.6: Results for moving background with no moving foreground.

Table 7.2: Statistical results for moving background without moving foreground.

MethodDetection rateFalse alarm rate
Based on Gaussian mixing model 0.9510.017
Based on codebook0.9390.006

Finally, it should be noted that the method based on single Gaussian model is relatively simpler, but the situations it could be used are less popular, as it can be only used for the cases of stationary background without moving foreground. Other methods have tried to overcome the limitations of single-Gaussian model-based method, but their common problem is that if the background needs to update, then the entire background model should be recalculated, rather than just updating parameters with a simple iteration.

7.3.2Optical Flow

The movement of objects in scene can make the objects to appear in different relative positions in images captured during object movements. This difference in position may be called parallax, which corresponds to displacement vector (including magnitude and direction) reflected in the image. If the parallax is divided by the time difference, then the velocity vector (also called instantaneous displacement vector) will be obtained. All the velocity vectors together (may vary among them) constitute a vector field and in many cases may also be referred to as optical flow field (more distinctions will be discussed in Volume III of this book set).

7.3.2.1Optical Flow Equation

Let a particular image point is at (x, y) at time t, this point will move to the point (x + dx, y + dy) at time t + dt. If the time interval dt is small, it may be desirable (or assumed) that the gray level of this image point remains unchanged. In other words, there is:

f(x,y,t)=f(x+dx,y+dy,t+dt)                (7.15)

The right-hand side may be expanded with a Taylor series, let dt → 0, taking the limit and omitting the higher order terms available will produce

ft=fxxt+fyyt=fxu=fyv=0               (7.16)

where u and v are the moving speed of the image point in the X and Y directions, respectively, which constitutes a velocity vector. Let

fx=f/x        fy=f/y      ft=f/t              (7.17)

Then, the optical flow equation is obtained:

[fx,fy][u,v]T=ft             (7.18)

Optical flow equation shows that the gray level changing rate in time at a point is the product of the gray level changing rate in space at this point and the moving velocity of this point in space.

In practice, the gray-level changing rate in time can be estimated with the first-order differential average along the time axis:

ft14[f(x,y,t+1)+f(x+1,y,t+1)+f(x,y+1,t+1)+f(x+1,y+1,t+1)]      14[f(x,y,t)+f(x+1,y,t)+f(x,y+1,t)+f(x+1,y+1,t)]                    (7.19)

The gray-level changing rate in space can be estimated with the first-order differential average along X and Y directions:

ft14[f(x+1,y,t)+f(x+1,y+1,t)+f(x+1,y,t+1)+f(x+1,y+1,t+1)]      14[f(x,y,t)+f(x,y+1,t)+f(x,y,t+1)+f(x,y+1,t+1)]                    (7.20)

ft14[f(x,y+1,t)+f(x+1,y+1,t)+f(x,y+1,t+1)+f(x+1,y+1,t+1)]      14[f(x,y,t)+f(x+1,y,t)+f(x,y,t+1)+f(x+1,y+1,t+1)]                    (7.21)

7.3.2.2Optical Flow Estimation Using Least Squares

Formulas (7.19)–(7.21), after substituting into eq. (7.18), can be used to estimate the optical flow components u and v with the help of least squares. In two adjacent images f(x, y, t) and f(x, y, t + 1), pixels with the same u and v are selected at N different positions. Let f^t(k),f^x(k), and f^y(k) represent the estimations of ft,fx, and fy, at kth position (k = 1,2,..., N), respectively:

ft=[f^t(1)f^t(2)f^t(N)]    Fxy=[f^x(1)f^x(2)f^x(N)f^y(1)f^y(2)f^y(N)]                 (7.22)

Then the least squares estimations for u and v are:

[uv]T=(FxyTFxy)1FxyTft               (7.23)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset