0

INNER PRODUCT SPACES

0.1 MOTIVATION

For two vectors X = (x1, x2, x3), Y = (y1, y2, y3) in R3, the standard (Euclidean) inner product of X and Y is defined as

c00e001

This definition is partly motivated by the desire to measure the length of a vector, which is given by the Pythagorean Theorem:

c00e002

The goal of this chapter is to define the concept of an inner product in a more general setting that includes a wide variety of vector spaces. We are especially interested in the inner product defined on vector spaces whose elements are signals (i.e., functions of time).

0.2 DEFINITION OF INNER PRODUCT

The definition of an inner product in R3 naturally generalizes to Rn for any dimension n. For two vectors X = (x1, x2, …, xn), Y = (y1, y2,… , yn) in Rn, the Euclidean inner product is

c00e003

When we study Fourier series and the Fourier transform, we will use the complex exponential. Thus, we must consider complex vector spaces as well as real ones. The preceding definition can be modified for vectors in Cn by conjugating the second factor. Recall that the conjugate of a complex number z = x + iy is defined as c00ie006 Note that c00ie007 which by definition is |z|2 [the square of the length of z = x + iy regarded as a vector in the plane from (0, 0) to (x, y)].

If Z = (z1, z2,…, zn), W = (w1, w2,..., wn) are vectors in Cn, then

c00e004

The purpose of the conjugate is to ensure that the length of a vector in Cn is real:

c00e005

The inner products just defined share certain properties. For example, the inner product is bilinear, which means

c00e006

The rest of the properties satisfied by the aforementioned inner products are set down as axioms in the following definition. We leave the verification of these axioms for the inner products for Rn and Cn as exercises.

Definition 0.1 An inner product on a complex vector space V is a function 〈·, ·〉 : V × VC that satisfies the following properties.

  • Positivity:v, v〉 > 0 for each nonzero v ∈ V.
  • Conjugate symmetry: c00ie001 for all vectors v and w in V.
  • Homogeneity:cv, w= cv, w〉 for all vectors v and w in V and scalars c ∈ C.
  • Linearity:u + v, w〉 = 〈u, w〉 + 〈v, w〉 for all u, v, w ∈ V.

A vector space with an inner product is called an inner product space.

To emphasize the underlying space V, we sometimes denote the inner product on V by

c00e007

The preceding definition also serves to define a real inner product on a real vector space except that the scalar c in the homogeneity property is real and there is no conjugate in the statement of conjugate symmetry.

Note that the second and fourth properties imply linearity in the second factor: 〈u, v + w〉 = 〈u, v〉 + 〈u, w〉. The second and third properties imply that scalars factor out of the second factor with a conjugate:

c00e008

The positivity condition means that we can assign the nonzero number, ||v|| = c00ie002, as the length or norm of the vector v. The notion of length gives meaning to the distance between two vectors in V, by declaring that

c00e009

Note that the positivity property of the inner product implies that the only way ||vw|| = 0 is when v = w. This notion of distance also gives meaning to the idea of a convergent sequence {vk; k = 1, 2,...}; namely, we say that

c00e010

In words, vkv if the distance between vk and v gets small as k gets large.

Here are some further examples of inner products.

Example 0.2 Let V be the space of polynomials p = anxn +…+ a1x + a0, with aj ∈ C. An inner product on V is given as follows: If p = a0 + a1x +… + anxn and q = b0 + b1x +… + bnxn, then

c00e011

Note that this inner product space looks very much like Cn+1 where we identify a point (a0,..., an) ∈ Cn+1 with a0 + a1x + … + anxn.

Example 0.3 Different inner products can be imposed on the same vector space. This example defines an inner product on C2 which is different than the standard Euclidean inner product. Suppose v = (v1, v2) and w = (w1, w2) are vectors in C2, define

c00e012

There is nothing special about the particular choice of matrix. We can replace the matrix in the preceding equation with any matrix A as long as it is Hermitian symmetric (meaning that c00ie003 which is needed for conjugate symmetry) and positive definite (meaning that all eigenvalues are positive, which is needed for the positivity axiom). Verification of these statements will be left as exercises.

0.3 THE SPACES L2 AND l2

0.3.1 Definitions

The examples in the last section are all finite-dimensional. In this section, we discuss a class of infinite-dimensional vector spaces which is particularly useful for analyzing signals. A signal (for example, a sound signal) can be viewed as a function, f(t), which indicates the intensity of the signal at time t. Here t varies in an interval atb which represents the time duration of the signal. Here, a could be –∞ or b could be +∞.

We will need to impose a growth restriction on the functions defined on the interval atb. This leads to the following definition.

Definition 0.4 For an interval atb, the space L2([a, b]) is the set of all square integrable functions defined on atb. In other words,

c00e013

Functions that are discontinuous are allowed as members of this space. All the examples considered in this book are either continuous or discontinuous at a finite set of points. In this context, the preceding integral can be interpreted in the elementary Riemann sense (the one introduced in Freshmen Calculus). The definition of L2 allows functions whose set of discontinuities is quite large, in which case the Lebesgue integral must be used. The condition c00ie004 physically means that the total energy of the signal is finite (which is a reasonable class of signals to consider).

The space L2[a, b] is infinite-dimensional. For example, if a = 0 and b = 1, then the set of functions {1, t, t2, t3...} is linearly independent and belongs to L2[0, 1]. The function f(t) = 1/t is an example of a function that does not belong to L2[0, 1] since c00ie005.

L2 Inner Product. We now turn our attention to constructing an appropriate inner product on L2[a, b]. To motivate the L2 inner product, we discretize the interval [a, b]. To simplify matters, let a = 0 and b = 1. Let N be a large positive integer and let tj = j/N for 1 ≤ jN. If f is continuous, then the values of f on the interval [tj, tj+1) can be approximated by f(tj). Therefore, f can be approximated by the vector

c00e014

as illustrated in Figure 0.1. As N gets larger, fN becomes a better approximation to f.

If f and g are two signals in L2[0, 1], then both signals can be discretized as fN and gN. One possible definition of 〈f, gL2 is to examine the ordinary RN inner product of fN and gN as N gets large:

c00e015

Figure 0.1. Approximating a continuous function by discretization.

c00f001

The trouble with this approach is that as N gets large, the sum on the right typically gets large. A better choice is to consider the averaged inner product:

c00e016

Since fN and gN approach f and g as N gets large, a reasonable definition of 〈f, gL2 is to take the limit of this averaged inner product as N → ∞.

The preceding equation can be written as

c00e017

The sum on the right is a Riemann sum approximation to c00ie008 over the partition [0, t1, t2,..., tN = 1] of [0, 1]. This approximation gets better as N gets larger. Thus, a reasonable definition of an inner product on L2[0, 1] is c00ie009 This motivation provides the basis for the following definition.

Definition 0.5 The L2 inner product on L2([a, b]) is defined as

c00e018

The conjugate symmetry, homogeneity, and bilinearity properties are all easily established for this inner product and we leave them as exercises.

For the positivity condition, if c00ie010 and if f is continuous, then f(t) = 0 for all t (see Exercise 4). If f(t) is allowed to be discontinuous at a finite number of points, then we can only conclude that f(t) = 0 at all but a finite number of t values. For example, the function

c00e019

is not the zero function yet –11 |f(t)|2 dt = 0. However, we stipulate that two elements f and g in L2([a,b]) are equal if f(t) = g(t) for all values of t except for a finite number of t values (or, more generally, a set of measure zero if the Lebesgue integral is used). This is a reasonable definition for the purposes of integration, since c00ie011 for such functions. With this convention, the positivity condition holds.

This notion of equivalence is reasonable from the point of view of signal analysis. The behavior of a signal at one instant in time (say t = 0) is rarely important. The behavior of a signal over a time interval of positive length is important. Although measure theory and the Lebesgue integral are not used in this text, we digress to discuss this topic just long enough to put the notion of equivalence discussed in the previous paragraph in a broader context. The concept of measure of a set generalizes the concept of length of an interval. The measure of an interval {a < t < b} is defined to be b – a. The measure of a disjoint union of intervals is the sum of their lengths. So the measure of a finite (or countably infinite) set of points is zero. The measure of a more complicated set can be determined by decomposing it into a limit of sets that are disjoint unions of intervals. Since intervals of length zero have no effect on integration, it is reasonable to expect that if a function f is zero on atb except on a set of measure zero, then ab f(t) dt = 0. The converse is also true: If

c00e020

then f (t) = 0 on atb except possibly on a set of measure zero. For this reason, it is reasonable to declare that two functions, f and g in L2[a, b], are equivalent on [a, b] if f (t) = g(t) for all t in [a, b] except possibly for a set of measure zero. This general notion of equivalence includes the definition stated in the previous paragraph (that two functions are equivalent if they agree except at a finite number of points). For more details, consult a text on real analysis [e.g., Folland (1992)].

The Space l2. For many applications, the signal is already discrete. For example, the signal from a compact disc player can be represented by a discrete set of numbers that represent the intensity of its sound signal at regular (small) time intervals. In such cases, we represent the signal as a sequence X =..., x–1, x0, x1,... where each Xj is the numerical value of the signal at the jth time interval [tj, tj+1]. Theoretically, the sequence could continue indefinitely (either as j → ∞ or as j→ – ∞ or both). In reality, the signal usually stops after some point, which mathematically can be represented by xj = 0 for |j| > N for some integer N.

The following definition describes a discrete analogue of L2.

Definition 0.6 The space l2 is the set of all sequences X =..., x–1, x0, x1 ,..., xiC, with c00ie012 The inner product on this space is defined as

c00e021

for X =..., x–1, x0, x1,..., and Y =...,y–1 y0, y1,....

Verifying that 〈·, ·〉 is an inner product for l2 is relatively easy and will be left to the exercises.

Relative Error. For two signals, f and g, the L2 norm of their difference, || fg || L2, provides one way of measuring how f differs from g. However, often the relative error is more meaningful:

c00e022

(the denominator could also be || g || L2). The relative error measures the L2 norm of the difference between f and g in relation to the size of || f || L2. For discrete signals, the l2 norm is used.

0.3.2 Convergence in L2 Versus Uniform Convergence

As defined in Section 0.2, a sequence of vectors {vn; n = 1, 2,...} in an inner product space V is said to converge to the vector v ∈ V provided that vn is close to v when n is large. Closeness here means that ||vnv|| is small. To be more mathematically precise, vn converges to v if || vnv || → 0 as n → ∞.

In this text, we will often deal with the inner product space L2[a,b] and therefore we discuss convergence in this space in more detail.

Definition 0.7 A sequence fn converges to f in L2[a, b] if || fnf ||L2 → 0 as n → ∞. More precisely, given any tolerance > 0, there exists a positive integer N such that if nN, then || ffn||L2 < .

Convergence in L2 is sometimes called convergence in the mean. There are two other types of convergence often used with functions.

Definition 0.8

1. A sequence fn converges to f pointwise on the interval atb if for each t ∈ [a, b] and each small tolerance > 0, there is a positive integer N such that if n ≥ N, then |fn(t) – f(t)| < .

2. A sequence fn converges to f uniformly on the interval atb if for each small tolerance > 0, there is a positive integer N such that if nN, then |fn(t) – f(t)| < for all atb

For uniform convergence, the N only depends on the size of the tolerance and not on the point t, whereas for pointwise convergence, the N is allowed to also depend on the point t.

How do these three types of convergence compare? If fn uniformly converges to f on [a,b], then the values of fn are close to f over the entire interval [a,b]. For example, Figure 0.2 illustrates the graphs of two functions which are uniformly close to each other. By contrast, if fn converges to f pointwise, then for each fixed t, fn(t) is close to f(t) for large n. However, the rate at which fn(t) approaches f(t) may depend on the point t. Thus, a sequence that converges uniformly also must converge pointwise, but not conversely.

Figure 0.2. Uniform approximation.

c00f002

Figure 0.3. L2 approximation.

c00f003

If fn converges to f in L2[a, b], then, on average, fn is close to f; but for some values, fn(t) may be far away from f(t). For example, Figure 0.3 illustrates two functions that are close in L2 even though some of their function values are not close.

Example 0.9 The sequence of functions fn(t) = tn, n = 1, 2, 3..., converges pointwise to f(t) = 0 on the interval 0 ≤ t < 1 because for any number 0 ≤ t < 1, tn → 0 as n → ∞. However, the convergence is not uniform. The rate at which tn approaches zero becomes slower as t approaches 1. For example, if t = 1/2 and = 0.001, then |tn| < provided that n ≥ 10. However, if t = 0.9, then |tn| is not less than ∈ until n ≥ 66.

For any fixed number r < 1, fn converges uniformly to f = 0 on the interval [0, r]. Indeed if 0 ≤ tr, then |tn| ≤ rn. Therefore, as long as rn is less than , | fn(t)| will be less than ∈ for all 0 ≤ tr. In other words, the rate at which fn approaches zero for all points on the interval [0, r] is no worse than the rate at which rn approaches zero.

We also note that fn → 0 in L2[0, 1] because

c00e023

As the following theorem shows, uniform convergence on a finite interval [a, b] is a stronger type of convergence than L2 convergence.

Theorem 0.10 If a sequence fn converges uniformly to f as n → ∞ on a finite interval a ≤ t ≤ b, then this sequence also converges to f in L2[a, b]. The converse of this statement is not true.

Proof. Using the definition of uniform convergence, we can choose, for a given tolerance > 0, an integer N such that

c00e024

This inequality implies

c00e025

Therefore, if nN, we have c00ie013 Since can be chosen as small as desired, this inequality implies that fn converges to f in L2.

To show that the converse is false, consider the following sequence of functions on 0 ≤ t ≤ 1.

c00e026

We leave it to the reader (see Exercise 6) to show that this sequence converges to the zero function in L2[0, 1] but does not converge to zero uniformly on 0 ≤ t ≤ 1 (in fact, fn does not even converge to zero pointwise).

In general, a sequence that converges pointwise does not necessarily converge in L2. However, if the sequence is uniformly bounded by a fixed function in L2, then pointwise convergence is enough to guarantee convergence in L2 (this is the Lebesgue Dominated Convergence Theorem; see Folland (1999)). Further examples illustrating the relationships between these three types of convergence are developed in the Exercises.

0.4 SCHWARZ AND TRIANGLE INEQUALITIES

The two most important properties of inner products are the Schwarz and triangle inequalities. The Schwarz inequality states |〈X, Y〉| ≤ ||X|| ||Y||. In R3, this inequality follows from the law of cosines:

c00e027

where θ is the angle between X and Y. The triangle inequality states ||X + Y|| ≤ ||X|| + || Y ||. In R3, this inequality follows from Figure 0.4, which expresses the fact that the shortest distance between two points is a straight line.

The following theorem states that the Schwarz and Triangle inequalities hold for general inner product spaces.

Theorem 0.11 Suppose V, 〈·, ·〉 is an inner product space (either real or complex). Then for all X, Y ∈ V we have the following:

  • Schwarz Inequality: |〈X, Y〉| ≤ ||X|| ||Y||. Equality holds if and only if X and Y are linearly dependent. Moreover, 〈X, Y〉 = ||X|| ||Y|| if and only if X or Y is a nonnegative multiple of the other.
  • Triangle Inequality: ||X + Y|| ≤ ||X|| + ||Y||. Equality holds if and only if X or Y is a nonnegative multiple of the other.

Figure 0.4. Triangle inequality.

c00f004

Proof.

Proof for Real Inner Product Spaces. Assume that one of the vectors, say Y, is nonzero, for otherwise there is nothing to show. Let t be a real variable and consider the following inequality:

(0.1)c00e028

(0.2)c00e029

The right side is a nonnegative quadratic polynomial in t, and so it cannot have two distinct real roots. Therefore, its discriminant (from the quadratic formula) must be nonpositive. In our case, this means

c00e030

Schwarz’s inequality follows by rearranging this inequality.

If 〈X, Y〉 = ||X|| ||y||, then the preceding discriminant is zero, which means that the equation ||XtY2 = 0 has a double real root, c00ie014. In particular, c00ie015 0 or c00ie016, which implies that c00ie017. On the other hand, 〈X, Y〉 = ||X|| ||y|| is nonnegative and therefore c00ie018. Thus c00ie019 is a nonnegative multiple of Y, as claimed. The converse (i.e., if X is a nonnegative multiple of Y, then 〈X, Y〉 = ||X|| ||Y||) is easy and left to the reader.

Proof for a Complex Inner Product Space. If V is a complex inner product space the proof is similar. We let ϕ be an argument of 〈X, Y〉, which means

c00e031

Then we consider the following inequality:

c00e032

where “Re” stands for “the real part,” that is, c00ie020. In view of the choice of ϕ, the middle term is just –2t|〈X, Y〉| and so the term on the right equals the expression on the right side of (0.2). The rest of the argument is now the same as the argument given for the case of real inner product space.

Proof of the Triangle Inequality. The proof of the triangle inequality now follows from the Schwarz inequality:

c00e033

Taking square roots of both sides of this inequality establishes the triangle inequality.

If the preceding inequality becomes an equality, then 〈X, Y〉 = ||X|| ||Y|| and the first part of the theorem implies that either X or Y is a nonnegative multiple of the other, as claimed.

0.5 ORTHOGONALITY

0.5.1 Definitions and Examples

For the standard inner product in R3, the law of cosines is

c00e034

which implies that X and Y are orthogonal (perpendicular) if and only if 〈X, Y〉 = 0. We make this equation the definition of orthogonality in general.

Definition 0.12 Suppose V is an inner product space.

  • The vectors X and Y in V are said to be orthogonal if 〈X, Y〉 = 0.
  • The collection of vectors ei, i = 1,..., N, is said to be orthonormal if each ei has unit length, ||ei|| = 1, and ei and ej are orthogonal for ij.
  • Two subspaces V1 and V2 of V are said to be orthogonal if each vector in V1 is orthogonal to every vector in V2.

An orthonormal basis or orthonormal system for V is a basis of vectors for V which is orthonormal.

Example 0.13 The line y = x generated by the vector (1, 1) is orthogonal to the line y = –x generated by (1, –1).

Example 0.14 The line x/2 = –y = z/3 in R3, which points in the direction of the vector (2, –1, 3) , is orthogonal to the plane 2xy + 3z = 0.

Example 0.15 For the space L2([0, 1]), any two functions where the first function is zero, on the set where the second is nonzero will be orthogonal.

For example, if f(t) is nonzero only on the interval 0 ≤ t ≤ 1/2 and g(t) is nonzero only on the interval 1/2 ≤ t < 1, then c00ie021 is always zero. Therefore c00ie022.

Example 0.16 Let

c00e035

Then ϕ and ψ are orthogonal in L2[0, 1] because

c00e036

In contrast to the previous example, note that ϕ and ψ are orthogonal and yet ϕ and ψ are nonzero on the same set, namely the interval 0 ≤ t ≤ 1. The function ϕ is called the scaling function and the function ψ is called the wavelet function for the Haar system. We shall revisit these functions in the chapter on Haar wavelets.

Example 0.17 The function f(t) = sin t and g(t) = cos t are orthogonal in L2([–π, π]), because

c00e037

Since c00ie023, the functions c00ie024 and c00ie025 are orthonormal in L2 ([–π, π]). More generally we shall show in the chapter on Fourier Series that the functions

c00e038

are orthonormal. This fact will be very important in our development of Fourier series.

Vectors can be easily expanded in terms of an orthonormal basis, as the following theorem shows.

Theorem 0.18 Suppose V0 is a subspace of an inner product space V. Suppose {e1,..., eN} is an orthonormal basis for V0. If vV0, then

c00e039

Proof. Since {e1,..., eN} is a basis for V0, any vector vV0 can be uniquely expressed as a linear combination of the ej:

c00e040

To evaluate the constant αk, take the inner product of both sides with ek:

c00e041

The only nonzero term on the right occurs when j = k since the ej are orthonormal. Therefore

c00e042

Thus, αk = 〈v, ek〉, as desired.

0.5.2 Orthogonal Projections

Suppose {e1,..., eN} is an orthonormal collection of vectors in an inner product space V . If v lies in the span of {e1,..., eN}, then as Theorem 0.18 demonstrates, the equation

(0.3)c00e043

is satisfied with αj = 〈v, ej〉. If v does not lie in the linear span of {e1,..., eN}, then solving Eq. (0.3) for αj is impossible. In this case, the best we can do is to determine the vector v0 belonging to the linear span of {e1,..., eN} that comes as close as possible to v. More generally, suppose V0 is a subspace of the inner product space V and suppose vV is a vector that is not in V0 (see Figure 0.5). How can we determine the vector v0V0 that is closest to v? This vector (v0) has a special name given in the following definition.

Figure 0.5. Orthogonal projection of v onto V0.

c00f005

Definition 0.19 Suppose V0 is a finite-dimensional subspace of an inner product space V . For any vector vV , the orthogonal projection of v onto V0 is the unique vector v0V0 that is closest to v, that is,

c00e044

As Figure 0.5 indicates, the vector v0, which is closest to v, must be chosen so that vv0 is orthogonal to V0. Of course, figures are easily drawn when the underlying vector space is R2 or R3. In a more complicated inner product space, such as L2, figures are an abstraction, which may or may not be accurate (e.g., an element in L2 is not really a point in the plane as in Figure 0.5). The following theorem states that our intuition in R2 regarding orthogonality is accurate in a general inner product space.

Theorem 0.20 Suppose V0 is a finite-dimensional subspace of an inner product space V . Let v be any element in V . Then its orthogonal projection, v0, has the following property: vv0 is orthogonal to every vector in V0.

Proof. We first show that if v0 is the closest vector to v, then vv0 is orthogonal to every vector wV0. Consider the function

c00e045

which describes the square of the distance between v 0 + twV0 and v. If v 0 is the closest element of V0 to v, then f is minimized when t = 0. For simplicity, we will consider the case where the underlying inner product space V is real. Expanding f, we have

c00e046

Since f is minimized when t = 0, its derivative at t = 0 must be zero. We have

c00e047

So

(0.4)c00e048

and we conclude that v0v is orthogonal to w.

The converse also holds: If v0v is orthogonal to w, then from (0.4) we have f′(0) = 0. Since f(t) is a nonnegative quadratic polynomial in t, this critical point t = 0 must correspond to a minimum. Therefore, ||v0 + twv||2 is minimized when t = 0. Since w is an arbitrarily chosen vector in V0, we conclude that v0 is the closest vector in V0 to v.

In terms of an orthonormal basis for V0, the projection of a vector v onto V0 is easy to compute, as the following theorem states.

Theorem 0.21 Suppose V is an inner product space and V0 is an N-dimensional subspace with orthonormal basis {e1, e2, …, eN}. The orthogonal projection of a vector v ∈ V onto V0 is given by

c00e049

Note. In the special case that v belongs to V0, then v equals its orthogonal projection, v0. In this case, the preceding formula for v0 = v is the same as the one given in Theorem 0.18.

Proof. Let c00ie026 with αj = 〈v, ej〉. In view of Theorem 0.20, we must show that vv0 is orthogonal to any vector wV0. Since e1,..., eN is a basis for V0, it suffices to show that vv0 is orthogonal to each ek, k = 1,..., N. We have

c00e050

Since e1,..., eN are orthonormal, the only contributing term to the sum is when j = k:

c00e051

Thus, vv0 is orthogonal to each ek and hence to all of V0, as desired.

Example 0.22 Let V0 be the space spanned by cos x and sin x in L2([–π, π). As computed in Example 0.17, the functions

c00e052

are orthonormal in L2([–π, π). Let f (x) = x. The projection of f onto V0is given by

c00e053

Now, f(x) cos(x) = x cos(x) is odd and so c00ie027 0. For the other term,

c00e054

Therefore the projection of f (x) = x onto V0 is given by

c00e055

Example 0.23 Consider the space V1 which is spanned by ϕ(x) = 1 on 0 ≤ x < 1 and

c00e056

The functions ϕ and ψ are the Haar scaling function and wavelet function mentioned earlier. These two functions are orthonormal in L2([0, 1]). Let f(x) = x. As you can check

c00e057

and

c00e058

So the orthogonal projection of the function f onto V1 is given by

c00e059

The set of vectors that are orthogonal to a given subspace has a special name.

Definition 0.24 Suppose V0 is a subspace of an inner product space V. The orthogonal complement of V0, denoted c00ie029, is the set of all vectors in V which are orthogonal to V0, that is,

c00e060

As Figure 0.6 indicates, each vector can be written as a sum of a vector in V0 and a vector in c00ie029. The intuition from this Euclidean figure is accurate for more general inner product spaces as the following theorem demonstrates.

Theorem 0.25 Suppose V0 is a finite-dimensional subspace of an inner product space V. Each vector v ∈ V can be written uniquely as v = v0 + v1, where v 0 belongs to V0 and v1 belongs to c00ie029, that is,

c00e061

Proof. Suppose v belongs to V and let v 0 be its orthogonal projection onto V0. Let v1 = vv0; then

c00e062

By Theorem 0.20, v1 is orthogonal to every vector in V0. Therefore, v1 belongs to c00ie029.

Example 0.26 Consider the plane V0 = {2xy + 3z = 0}. The set of vectors

c00e063

forms an orthonormal basis. So given v = (x, y, z) ∈ R3, the vector

c00e064

is the orthogonal projection of v onto the plane V0.

The vector c00ie030 is a unit vector that is perpendicular to this plane. So

c00e065

is the orthogonal projection of v = (x, y, z) onto c00ie029.

The theorems in this section are valid for certain infinite-dimensional sub-spaces, but a discussion of infinite dimensions involves more advanced ideas from functional analysis (see Rudin (1973)).

0.5.3 Gram-Schmidt Orthogonalization

Theorems 0.18 and 0.21 indicate the importance of finding an orthonormal basis. Without an orthonormal basis, the computation of an orthogonal projection onto a subspace is more difficult. If an orthonormal basis is not readily available, then the Gram-Schmidt orthogonalization procedure describes a way to construct one for a given subspace.

Theorem 0.27 Suppose V0 is a subspace of dimension N of an inner product space V. Let vj, j = 1,..., N be a basis for V0. Then there is an orthonormal basis {e1,..., eN} for V0 such that each ej is a linear combination of v1,..., vj.

Proof. We first define e1 = v1/||v1||. Clearly, e1 has unit length. Let v0 be the orthogonal projection of v2 onto the line spanned by e1. From Theorem 0.21,

c00e066

Figure 0.7 suggests that the vector from v0 to v2 is orthogonal to e1. So let

c00e067

Figure 0.7. Gram–Schmidt orthogonalization.

c00f007

and note that

c00e068

which confirms our intuition from Figure 0.7.

Note that E2 cannot equal zero because otherwise v2 and e1 (and hence v2 and v1) would be linearly dependent. To get a vector of unit length, we define e2 = E2/||E2||. Both e1 and e2 are orthogonal to each other; and since e1 is a multiple of v1, the vector e2 is a linear combination of v1 and v2.

If N > 2, then we continue the process. We consider the orthogonal projection of v3 onto the space spanned by e1 and e2:

c00e069

Then let

c00e070

and set e3 = E3/||E3||. The same argument as before (for E2) shows that E3 is orthogonal to both e1 and e2. Thus, {e1, e2, e3} is an orthonormal set of vectors. The pattern is now clear.

0.6 LINEAR OPERATORS AND THEIR ADJOINTS

0.6.1 Linear Operators

First, we recall the definition of a linear map.

Definition 0.28 A linear operator (or map) between a vector space V and a vector space W is a function T : VW that satisfies

c00e071

If V and W are finite-dimensional, then T is often identified with its matrix representation with respect to a given choice of bases, say {v1,... ,vn) for V and {w1,..., wm} for W. For each 1 ≤ jn, T(vj) belongs to W and therefore it can be expanded in terms of w1,..., wm:

(0.5)c00e072

where aij are complex numbers. The value of T(v) for any vector c00ie031c00ie032 can be computed by

c00e073

The coefficient of wi is c00ie033, which can be identified with the ith entry in the following matrix product:

c00e074

Thus, the matrix (aij) is determined by how the basis vectors vj ∈ V are mapped into the basis vectors wi of W [see Eq. (0.5)]. The matrix then determines how an arbitrary vector v maps into W.

Orthogonal projections also define linear operators. If V0 is a finite dimensional subspace of an inner product space V, than by Theorem 0.25 we can define the orthogonal projection onto V0 to be P, where P(v) = v0. By Theorem 0.21, it is easy to show that P is a linear operator.

A linear operator T : VW is said to be bounded if there is a number 0 ≤ M < ∞ such that

c00e075

In this case, the norm of T is defined to be the smallest such M. In words, a bounded operator takes the unit ball in V to a bounded set in W. As a result, a bounded operator takes bounded sets in V to bounded sets in W. All linear maps between finite-dimensional inner product spaces are bounded. As another example the orthogonal projection map from any inner product space to any of its subspaces (finite or infinite dimensions) is a bounded linear operator.

0.6.2 Adjoints

If V and W are inner product spaces, then we will sometimes need to compute (T(v), w)W by shifting the operator T to the other side of the inner product. In other words, we want to write

c00e076

for some operator T* : WV . We formalize the definition of such a map as follows.

Definition 0.29 If T : VW is a linear operator between two inner product spaces. The adjoint of T is the linear operator T * : WV , which satisfies

c00e077

Every bounded linear operator between two inner product spaces always has an adjoint. Here are two examples of adjoints of linear operators.

Example 0.30 Let V = Cn and W = Cm with the standard inner products. Suppose T : CnCm is a linear operator with matrix aij ∈ C with respect to the standard basis

c00e078

If X = (x1,...,xn)Cn, and Y = (y1,...,ym)Cm, then

c00e079

where c00ie034 (the conjugate of the transpose). The right side is 〈X, T*(Y)〉, where the jth component of T*(Y) is c00ie035. Thus the matrix for the adjoint of T is the conjugate of the transpose of the matrix for T.

Example 0.31 Suppose g is a bounded function on the interval axb. Let Tg : L2([a, b])L2([a, b]) be defined by

c00e080

The adjoint of Tg is just

c00e081

because

c00e082

The next theorem computes the adjoint of the composition of two operators.

Theorem 0.32 Suppose T1 : VW and T2 : WU are bounded linear operators between inner product spaces. Then c00ie036.

Proof. For v ∈ V and u ∈ U, we have

c00e083

On the other hand, the definition of adjoint implies

c00e084

Therefore

(0.6)c00e085

all v ∈ V. By Exercise 17, if u0 and u1 are vectors in V with 〈v, u0〉 = 〈v, u1〉 for all v ∈ V, then u0 = u1. Therefore, from Eq. (0.6), we conclude that

c00e086

as desired.

In the next theorem, we compute the adjoint of an orthogonal projection.

Theorem 0.33 Suppose V0 is a subspace of an inner product space V. Let P be the orthogonal projection onto V0. Then P* = P.

Proof. From Theorem 0.25, each vector v ∈ V can be written as v = v0 + v1 with v0V0 and v1 orthogonal to V0. Note that P(v) = v0. Similarly, we can write any u ∈ V as u = u0 + u1 with u0 = P(u) ∈ V0 and c00ie037. We have

c00e087

since v0V0 and c00ie037.

Similarly,

c00e088

Therefore 〈v, P(u)〉 = 〈P(v), u〉 and so P* = P.

0.7 LEAST SQUARES AND LINEAR PREDICTIVE CODING

In this section, we apply some of the basic facts about linear algebra and inner product spaces to the topic of least squares analysis. As motivation, we first describe how to find the best-fit line to a collection of data. Then, the general idea behind the least squares algorithm is presented. As a further application of least squares, we present the ideas behind linear predictive coding, which is a data compression algorithm. In the next chapter, least squares will be used to develop a procedure for approximating signals (or functions) by trigonometric polynomials.

0.7.1 Best-Fit Line for Data

Consider the following problem. Suppose real data points xi and yi for i = 1,..., N are given with xixj for ij. We wish to find the equation of the line y = mx + b which comes closest to fitting all the data. Figure 0.8 gives an example of four data points (indicated by the four small circles) as well as the graph of the line which comes closest to passing through these four points.

The word closest here means that the sum of the squares of the errors (between the data and the line) is smaller than the corresponding error with any other line. Suppose the line we seek is y = mx + b. As Figure 0.9 demonstrates, the error between this line at x = xi and the data point (xi, yi) is |yi – (mxi + b)|. Therefore, we seek numbers m and b which minimize the quantity

c00e089

Figure 0.8. Least squares approximation.

c00f008

Figure 0.9. Error at xi is |yi – (mxi + b)| (length of dashed line).

c00f009

The quantity E can be viewed as the square of the distance (in RN) from the vector

c00e090

and the vector mX + bU, where

(0.7)c00e091

As m and b vary over all possible real numbers, the expression mX + bU sweeps out a two-dimensional plane, M in RN. Thus our problem of least squares has the following geometric interpretation: Find the point P = mX + bU on M, which is closest to the point Y (see Figure 0.10). The point P must be the orthogonal projection of Y onto M. In particular, YP must be orthogonal to M. Since M is generated by the vectors X and U, Y -–P must be orthogonal to both X and U. Therefore, we seek the point P = mX + bU that satisfies the following two equations:

c00e092

or

c00e093

These equations can be rewritten in matrix form as

c00e094

Figure 0.10. Closest point on the plane to the point Y .

c00f010

Keep in mind that the xi and yi are the known data points. The solution to our least squares problem is obtained by solving the preceding linear system for the unknowns m and b. This discussion is summarized in the following theorem.

Theorem 0.34 Suppose X = {x1, x2,..., xN} and Y = {y1, y2,..., yN} are two sets of real data points. The equation of the line y = mx + b which most closely approximates the data (x1, y1),..., (xN, yN) in the sense of least squares is obtained by solving the linear equation

c00e095

for m and b where

c00e096

If the xi are distinct, then this system of equations has the following unique solution for m and b:

c00e097

where c00ie038 and c00ie039.

Proof. We leave the computation of the formulas for m and b as an exercise (see Exercise 24). The statement regarding the unique solution for m and b will follow once we show that the matrix ZTZ is nonsingular (i.e., invertible). If the xi, are not all the same (in particular, if they are distinct), then the vectors X and U are linearly independent and so the matrix Z has rank two. In addition, for any V ∈ R2

c00e098

Since Z is a matrix of maximal rank (i.e., rank two), the only way (Z)V can be zero is if V is zero. Therefore, 〈(ZTZ)V, V〉 > 0 for all nonzero V, which means that ZTZ is positive definite. In addition, the matrix ZTZ is symmetric because its transpose, (ZTZ)T, equals itself. Using a standard fact from linear algebra, this positive definite symmetric matrix must be nonsingular. Thus, the equation

c00e099

has a unique solution for m and b.

0.7.2 General Least Squares Algorithm

Suppose Z is a N × q matrix (with possibly complex entries) and let Y be a vector in RN (or CN). Linear algebra is, in part, the study of the equation ZV = Y, which when written out in detail is

c00e100

If N > q, then the equation ZV = Y does not usually have a solution for V v Cq because there are more equations (N) than there are unknowns (v1,... , vq). If there is no solution, the problem of least squares asks for the next best quantity: Find the vector VCq such that ZV is as close as possible to Y.

In the case of finding the best-fit line to a set of data points (xi , yi), i = 1... N, the matrix Z is

(0.8)c00e101

and the vectors Y and V are

c00e102

In this case, the matrix product ZV is

c00e103

where X and U are the vectors given in Eq. (0.7). Thus finding the V = (m, b) so that ZV is closest to Y is equivalent to finding the slope and y-intercept of the best fit line to the data (xi, yi), i = 1 ,..., N, as in the last section.

The solution to the general least squares problem is given in the following theorem.

Theorem 0.35 Suppose Z is an N × q matrix (with possibly complex entries) of maximal rank and with Nq. Let Y be a vector in RN (or CN). There is a unique vector V ∈ Cq such that ZV is closest to Y. Moreover, the vector V is the unique solution to the matrix equation

c00e104

Figure 0.11. Y – ZV must be orthogonal to M = span{Z1,..., Zq}.

c00f011

If Z is a matrix with real entries, then the preceding equation becomes

c00e105

Note that in the case of the best-fit line, the matrix Z in Eq. (0.8) and the equation ZTY = ZTZV are the same as those given in Theorem 0.34.

Proof. The proof of this theorem is similar to the proof given in the construction of the best-fit line. We let Z1,..., Zq be the columns of the matrix Z. Then ZV = v1Z1 + … + vqZq is a point that lies in the subspace MCN generated by Z1,..., Zq. We wish to find the point ZV that is closest to Y. As in Figure 0.11, YZV must be orthogonal to M ; or equivalently, YZV must be orthogonal to Z1,..., Zq which generate M. Thus

c00e106

These equations can be written succinctly as

c00e107

because the ith component of this (vector) equation is the inner product of YZV with Zi. This equation can be rearranged to read

c00e108

as claimed in the theorem.

The matrix Z*Z has dimension q × q and by the same arguments used in the proof of Theorem 0.34, you can show that this matrix is nonsingular (using the fact that Z has maximal rank). Therefore, the equation

c00e109

has a unique solution V ∈ Cq as claimed.

Example 0.36 Suppose a set of real data points {(xi , yi), i = 1,... , N} behaves in a quadratic rather than a linear fashion, then a best-fit quadratic equation of the form y = ax2 + bx + c can be found. In this case, we seek a, b and c which minimize the quantity

c00e110

We can apply Theorem 0.35 with

c00e111

From Theorem 0.35, the solution V = (a, b, c) to this least squares problem is the solution to ZT ZV = ZTY. Exercise 28 asks you to solve this system with specific numerical data.

0.7.3 Linear Predictive Coding

Here, we will apply the least squares analysis procedure to the problem of efficiently transmitting a signal. As mentioned earlier, computers can process millions—and in some cases, billions—of instructions per second. However, if the output must be transmitted from one location to another (say a picture downloaded from the web), the signal must often be sent over telephone lines or some other medium that can only transmit thousands of bytes per second (in the case of telephone lines, this rate is currently about 60 kilobytes per second). Therefore instead of transmitting all the data points of a signal, some sort of coding algorithm (data compression) is applied so that only the essential parts of the signal are transmitted.

Let us suppose we are transmitting a signal, which after some discretization process can be thought of as a long string of numbers

c00e112

(zeros and ones, perhaps). For simplicity, we will assume that each xi is real. Often, there is a repetition of some pattern of the signal (redundancy). If the repetition is perfect (say 1,1,0, 1,1,0, 1,1,0, etc.), then there would be no need to send all the digits. We would only need to transmit the pattern 1,1,0 and the number of times this pattern is repeated. Usually, however, there is not a perfect repetition of a pattern, but there may be some pattern that is nearly repetitive. For example, the rhythm of a beating heart is nearly, but not exactly, repetitive (if it is a healthy heart). If there is a near repetition of some pattern, then the following linear predictive coding procedure can achieve significant compression of data.

Main Idea. The idea behind linear predictive coding is to divide up the data into blocks of length N, where N is a large number.

c00e113

Let’s consider the first block of data x1,... , xN. We choose a number p that should be small compared to N. The linear predictive coding scheme will provide the best results (best compression) if p is chosen close to the number of digits in the near repetitive pattern of this block of data. Next, we try to find numbers a1, a2,... , ap that minimize the terms

(0.9)c00e114

in the sense of least squares. Once this is done (the details of which will be presented later), then the idea is to transmit x1... xp as well as a1,... , ap. Instead of transmitting xp+1, xp+2,... , we use the following scheme starting with n = p + 1. If e(p + 1) is smaller than some specified tolerance, then we can treat e(p + 1) as zero. By letting n = p + 1 and e(p + 1) = 0 in Eq. (0.9), we have

c00e115

There is no need to transmit xp+1 because the data x1... xp as well as a1... ap have already been transmitted and so the receiver can reconstruct xp+1 according to the preceding formula. If e(p + 1) is larger than the specified tolerance, then xp+1 (or equivalently e(p + 1) ) needs to be transmitted.

Once the receiver has reconstructed (or received) xp+1, n can be incremented to p + 2 in Eq. (0.9). If ep+2 is smaller than the tolerance, then xp+2 does not need to be transmitted and the receiver can reconstruct xp+2 by setting e(p + 2) = 0 in Eq. (0.9), giving

c00e116

The rest of the xp+3,... , xN can be reconstructed by the receiver in a similar fashion.

The hope is that if the ai have been chosen to minimize {e(p + 1),... , e(N)} in the sense of least squares, then most of the |e(n)| will be less than the specified tolerance and therefore most of the xn can be reconstructed by the receiver and not actually transmitted. The result is that instead of transmitting N pieces of data (i.e., x1,... , xN), we only need to transmit 2p pieces of data (i.e., a1,... , ap and x1,... , xp) and those (hopefully few) values of xn where |e(n)| is larger than the tolerance. Since 2p is typically much less than N, significant data compression can be achieved. The other blocks of data can be handled similarly, with possibly different values of p.

Role of Least Squares. To find the coefficients a1,... , ap, we use Theorem 0.35. We start by putting Eq. (0.9), for n = p + 1,... N, in matrix form:

c00e117

where

c00e118

and

c00e119

We want to choose V = (a1,..., ap)T so that ||E|| is as small as possible—or in other words, so that ZV is as close as possible to Y. From Theorem 0.35, V = (a1,..., ap)T is found by solving the following (real) matrix equation:

c00e120

Written out in detail, this equation is

(0.10)c00e121

where we have labeled the columns of the matrix of Z by Zp,... , Z1 (reverse order). The horizontal dots on either side of the c00ie040 indicate that these entries are row vectors. Likewise, the vertical dots above and below the Zi indicate that these entries are column vectors.

Equation (0.10) is a p × p system of equations for the a1,... , ap which can be solved in terms of the Z-vectors (i.e., the original signal points, x) via Gaussian elimination.

Summary of Linear Predictive Coding. Linear predictive coding involves the following procedure.

1. Sender cuts the data into blocks

c00e122

where each block has some near repetitive pattern. Then choose p close to the length of the repetitive pattern for the first block.

2. For 1 ≤ ip, form the vectors

c00e123

3. Sender solves the system of equations (0.10) for the coefficients a1,... , ap and transmits to the receiver both a1,... , ap and x1,... , xp.

4. The receiver then reconstructs xp+1,... , xN (in this order) via the equation

c00e124

for those xn where the corresponding least squares errors, e(n), are smaller than some specified tolerance. If e(n) is larger than the tolerance, then the sender must transmit xn.

Certainly, some work is required for the sender to solve the preceding equations for the a1,... , ap and for the receiver to reconstruct the xn. One may wonder whether this work is more than the energy required to transmit all the xi. However, it should be kept in mind that the work required to solve for the ai and reconstruct the xn is done by the sender and receiver with computers that can do millions or billions of operations per second, whereas the transmission lines may only handle thousands of data bits per second. So the goal is to shift, as much as possible, the burden from the relatively slow process of transmitting the data to the much faster process of performing computations by the computers located at either the sender or the receiver.

EXERCISES

1. Verify that the function

c00e125

for Z = (Z1,... , Zn), W = (W1 ,..., Wn) ∈ Cn defines an inner product on Cn (i.e., satisfies Definition 0.1).

2. Verify that the functions 〈 , 〉 defined in Examples 0.2 and 0.3 are inner products.

3. Define 〈V, W〉 for V = (v1, v2) and W = (w1, w2) ∈ C2 as

c00e126

Show that 〈V, V〉 = 0 for all vectors V = 〈v1, v2} with v1 + 2v2 = 0. Does 〈, 〉 define an inner product?

4. Show that the L2[a, b] inner product satisfies the following properties.

  • The L2 inner product is conjugate-symmetric (i.e., c00ie041, homogeneous, and bilinear (these properties are listed in Definition 0.1).
  • Show that the L2 inner product satisfies positivity on the space of continuous functions on [a, b] by using the following outline.

(a) We want to show that if c00ie042, then f(t) = 0 for all atb.

(b) Suppose, by contradiction, that |f(t0)| > 0; then use the definition of continuity to show that |f(t)| > |f(t0)|/2 on an interval of the form [t0δ, t0 + δ].

(c) Then show

c00e127

which contradicts the assumption that c00ie043.

5. Show that c00ie044 defines an inner product on l2.

6. For n > 0, let

c00e128

Show that fn → 0 in L2[0, 1]. Show that fn does not converge to zero uniformly on [0, 1].

7. For n > 0, let

c00e129

Show that fn → 0 in L2[0, 1] but that fn(0) does not converge to zero.

8. Is Theorem 0.10 true on an infinite interval such as [0, ∞)?

9. Compute the orthogonal complement of the space in R3 spanned by the vector (1, –2, 1).

10. Let f(t) = 1 on 0 ≤ t ≤ 1. Show that the orthogonal complement of f in L2[0, 1] is the set of all functions whose average value is zero.

11. Show that if a differentiable function, f, is orthogonal to cos(t) on L2[0, π] then f′ is orthogonal to sin(t) in L2[0, π]. Hint: Integrate by parts.

12. By using the Gram-Schmidt Orthogonalization, find an orthonormal basis for the subspace of L2[0, 1] spanned by 1, x, x2, x3.

13. Find the L2[0, 1] projection of the function cosx onto the space spanned by 1, x, x2, x3.

14. Find the L2[–π, π] projection of the function f(x) = x2 onto the space VnL2[–π, π] spanned by

c00e130

for n = 1. Repeat this exercise for n = 2 and n = 3. Plot these projections along with f using a computer algebra system. Repeat for g(x) = x3.

15. Project the function f(x) = x onto the space spanned by ϕ(x) ψ(x) ψ(2x) ψ(2x – 1) ∈ L2[0, 1], where

c00e131

16. Let D = {(x, y) ∈ R2; x2 + y2 ≤ 1}. Let

c00e132

Define an inner product on L2(D) by

c00e133

Let ϕn(x, y) = (x + iy)n, n = 0, 1, 2,.... Show that this collection of functions is orthogonal in L2(D) and compute ||ϕn||. Hint: Use polar coordinates.

17. Suppose u0 and u1 are vectors in the inner product space V with 〈u0, v>〉 = 〈u1, v〉 for all v ∈ V. Show that u0 = u1. Hint: Let v = u0u1.

18. Suppose A is an n × n matrix with complex entries. Show that the following are equivalent.

(a) The rows of A form an orthonormal basis in Cn.

(b) AA* = I (the identity matrix).

(c) ||Ax|| = ||x|| for all vectors x ∈ Cn.

19. Suppose K(x, y) is a continuous function which vanishes outside a bounded set in R × R. Define T : L2(R) →L2(R) by

c00e134

Show c00ie045. Note the parallel with the adjoint of a matrix c00ie046.

20. Suppose A : VW is a linear map between two inner product spaces. Show that c00ie047. Note: Ker stands for Kernel; Ker(A*) is the set of all vectors in W which are sent to zero by A*.

21. Prove the following theorem (Fredholm’s Alternative). Suppose A : VW is a linear map between two inner product spaces. Let b be any element in W. Then either

  • Ax = b has a solution for some x ∈ V or
  • There is a vector w ∈ W with A*w = 0 and 〈b, wW ≠ 0.

22. Suppose V0 is a finite-dimensional subspace of an inner product space, V. Show that c00ie048. Hint: The inclusion ⊂ is easy; for the reverse inclusion, take any element c00ie049 and then use Theorem 0.25 to decompose w into its components in V0 and c00ie029. Show that its c00ie029 component is zero.

23. Show that a set of orthonormal vectors is linearly independent.

24. Verify the formulas for m and b given in Theorem 0.34.

25. Prove the uniqueness part of Theorem 0.35; Hint: See the proof of the uniqueness part of Theorem 0.34.

26. Obtain an alternative proof (using calculus) of Theorem 0.34 by using the following outline.

(a) Show that the least squares problem is equivalent to finding m and b to minimize the error quantity

c00e135

(b) From calculus, show that this minimum occurs when

c00e136

(c) Solve these two equations for m and b.

27. Obtain the best-fit least squares line for these data: c00ie050

28. Repeat the previous problem with the best-fit least squares parabola.

29. This exercise is best done with Matlab (or something equivalent). The goal of this exercise is to use linear predictive coding to compress strings of numbers. Choose X = (x1 ,..., xN), where xj a is periodic sequence of period p and length N. For example, try xj = sin(/3) for 1 ≤ jN = 60 is a periodic sequence of length p = 6. Apply the linear predictive coding scheme to compute a1 ,..., ap. Compute the residual E = YZV. If done correctly, this residual should be theoretically zero (although the use of a computer will introduce a small round-off error). Now perturb X by a small randomly generated sequence (in Matlab, add rand(1,60) to X). Then re-apply linear predictive coding and see how many terms in the residual E are small (say less than 0.1). Repeat with other sequences X on your own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset