0: INNER PRODUCT SPACES

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

INNER PRODUCT SPACES

0.1 MOTIVATION

For two vectors X = (x₁, x₂, x₃), Y = (y₁, y₂, y₃) in R³, the standard (Euclidean) inner product of X and Y is defined as

This definition is partly motivated by the desire to measure the length of a vector, which is given by the Pythagorean Theorem:

The goal of this chapter is to define the concept of an inner product in a more general setting that includes a wide variety of vector spaces. We are especially interested in the inner product defined on vector spaces whose elements are signals (i.e., functions of time).

0.2 DEFINITION OF INNER PRODUCT

The definition of an inner product in R³ naturally generalizes to Rⁿ for any dimension n. For two vectors X = (x₁, x₂, …, x_n), Y = (y₁, y₂,… , y_n) in Rⁿ, the Euclidean inner product is

c00e003

When we study Fourier series and the Fourier transform, we will use the complex exponential. Thus, we must consider complex vector spaces as well as real ones. The preceding definition can be modified for vectors in Cⁿ by conjugating the second factor. Recall that the conjugate of a complex number z = x + iy is defined as Note that which by definition is |z|² [the square of the length of z = x + iy regarded as a vector in the plane from (0, 0) to (x, y)].

If Z = (z₁, z₂,…, z_n), W = (w₁, w₂,..., w_n) are vectors in Cⁿ, then

c00e004

The purpose of the conjugate is to ensure that the length of a vector in Cⁿ is real:

c00e005

The inner products just defined share certain properties. For example, the inner product is bilinear, which means

The rest of the properties satisfied by the aforementioned inner products are set down as axioms in the following definition. We leave the verification of these axioms for the inner products for Rⁿ and Cⁿ as exercises.

Definition 0.1 An inner product on a complex vector space V is a function 〈·, ·〉 : V × V→ C that satisfies the following properties.

Positivity: 〈v, v〉 > 0 for each nonzero v ∈ V.
Conjugate symmetry: for all vectors v and w in V.
Homogeneity: 〈cv, w〉 = c〈v, w〉 for all vectors v and w in V and scalars c ∈ C.
Linearity: 〈u + v, w〉 = 〈u, w〉 + 〈v, w〉 for all u, v, w ∈ V.

A vector space with an inner product is called an inner product space.

To emphasize the underlying space V, we sometimes denote the inner product on V by

The preceding definition also serves to define a real inner product on a real vector space except that the scalar c in the homogeneity property is real and there is no conjugate in the statement of conjugate symmetry.

Note that the second and fourth properties imply linearity in the second factor: 〈u, v + w〉 = 〈u, v〉 + 〈u, w〉. The second and third properties imply that scalars factor out of the second factor with a conjugate:

The positivity condition means that we can assign the nonzero number, ||v|| = , as the length or norm of the vector v. The notion of length gives meaning to the distance between two vectors in V, by declaring that

Note that the positivity property of the inner product implies that the only way ||v – w|| = 0 is when v = w. This notion of distance also gives meaning to the idea of a convergent sequence {v_k; k = 1, 2,...}; namely, we say that

In words, v_k → v if the distance between v_k and v gets small as k gets large.

Here are some further examples of inner products.

Example 0.2 Let V be the space of polynomials p = a_nxⁿ +…+ a₁x + a₀, with a_j ∈ C. An inner product on V is given as follows: If p = a₀ + a₁x +… + a_nxⁿ and q = b₀ + b₁x +… + b_nxⁿ, then

c00e011

Note that this inner product space looks very much like Cⁿ⁺¹ where we identify a point (a₀,..., a_n) ∈ Cⁿ⁺¹ with a₀ + a₁x + … + a_nxⁿ.

Example 0.3 Different inner products can be imposed on the same vector space. This example defines an inner product on C² which is different than the standard Euclidean inner product. Suppose v = (v₁, v₂) and w = (w₁, w₂) are vectors in C², define

There is nothing special about the particular choice of matrix. We can replace the matrix in the preceding equation with any matrix A as long as it is Hermitian symmetric (meaning that which is needed for conjugate symmetry) and positive definite (meaning that all eigenvalues are positive, which is needed for the positivity axiom). Verification of these statements will be left as exercises.

0.3 THE SPACES L² AND l²

0.3.1 Definitions

The examples in the last section are all finite-dimensional. In this section, we discuss a class of infinite-dimensional vector spaces which is particularly useful for analyzing signals. A signal (for example, a sound signal) can be viewed as a function, f(t), which indicates the intensity of the signal at time t. Here t varies in an interval a ≤ t ≤ b which represents the time duration of the signal. Here, a could be –∞ or b could be +∞.

We will need to impose a growth restriction on the functions defined on the interval a ≤ t ≤ b. This leads to the following definition.

Definition 0.4 For an interval a ≤ t ≤ b, the space L²([a, b]) is the set of all square integrable functions defined on a ≤ t ≤ b. In other words,

Functions that are discontinuous are allowed as members of this space. All the examples considered in this book are either continuous or discontinuous at a finite set of points. In this context, the preceding integral can be interpreted in the elementary Riemann sense (the one introduced in Freshmen Calculus). The definition of L² allows functions whose set of discontinuities is quite large, in which case the Lebesgue integral must be used. The condition physically means that the total energy of the signal is finite (which is a reasonable class of signals to consider).

The space L²[a, b] is infinite-dimensional. For example, if a = 0 and b = 1, then the set of functions {1, t, t², t³...} is linearly independent and belongs to L²[0, 1]. The function f(t) = 1/t is an example of a function that does not belong to L²[0, 1] since .

L² Inner Product. We now turn our attention to constructing an appropriate inner product on L²[a, b]. To motivate the L² inner product, we discretize the interval [a, b]. To simplify matters, let a = 0 and b = 1. Let N be a large positive integer and let t_j = j/N for 1 ≤ j ≤ N. If f is continuous, then the values of f on the interval [t_j, t_j+1) can be approximated by f(t_j). Therefore, f can be approximated by the vector

as illustrated in Figure 0.1. As N gets larger, f_N becomes a better approximation to f.

If f and g are two signals in L²[0, 1], then both signals can be discretized as f_N and g_N. One possible definition of 〈f, g〉_L² is to examine the ordinary R^N inner product of f_N and g_N as N gets large:

c00e015

Figure 0.1. Approximating a continuous function by discretization.

The trouble with this approach is that as N gets large, the sum on the right typically gets large. A better choice is to consider the averaged inner product:

c00e016

Since f_N and g_N approach f and g as N gets large, a reasonable definition of 〈f, g〉_L² is to take the limit of this averaged inner product as N → ∞.

The preceding equation can be written as

c00e017

The sum on the right is a Riemann sum approximation to over the partition [0, t₁, t₂,..., tN = 1] of [0, 1]. This approximation gets better as N gets larger. Thus, a reasonable definition of an inner product on L²[0, 1] is This motivation provides the basis for the following definition.

Definition 0.5 The L² inner product on L²([a, b]) is defined as

The conjugate symmetry, homogeneity, and bilinearity properties are all easily established for this inner product and we leave them as exercises.

For the positivity condition, if and if f is continuous, then f(t) = 0 for all t (see Exercise 4). If f(t) is allowed to be discontinuous at a finite number of points, then we can only conclude that f(t) = 0 at all but a finite number of t values. For example, the function

c00e019

is not the zero function yet ∫_–1¹ |f(t)|² dt = 0. However, we stipulate that two elements f and g in L²([a,b]) are equal if f(t) = g(t) for all values of t except for a finite number of t values (or, more generally, a set of measure zero if the Lebesgue integral is used). This is a reasonable definition for the purposes of integration, since for such functions. With this convention, the positivity condition holds.

This notion of equivalence is reasonable from the point of view of signal analysis. The behavior of a signal at one instant in time (say t = 0) is rarely important. The behavior of a signal over a time interval of positive length is important. Although measure theory and the Lebesgue integral are not used in this text, we digress to discuss this topic just long enough to put the notion of equivalence discussed in the previous paragraph in a broader context. The concept of measure of a set generalizes the concept of length of an interval. The measure of an interval {a < t < b} is defined to be b – a. The measure of a disjoint union of intervals is the sum of their lengths. So the measure of a finite (or countably infinite) set of points is zero. The measure of a more complicated set can be determined by decomposing it into a limit of sets that are disjoint unions of intervals. Since intervals of length zero have no effect on integration, it is reasonable to expect that if a function f is zero on a ≤ t ≤ b except on a set of measure zero, then ∫_a^b f(t) dt = 0. The converse is also true: If

then f (t) = 0 on a ≤ t ≤ b except possibly on a set of measure zero. For this reason, it is reasonable to declare that two functions, f and g in L²[a, b], are equivalent on [a, b] if f (t) = g(t) for all t in [a, b] except possibly for a set of measure zero. This general notion of equivalence includes the definition stated in the previous paragraph (that two functions are equivalent if they agree except at a finite number of points). For more details, consult a text on real analysis [e.g., Folland (1992)].

The Space l². For many applications, the signal is already discrete. For example, the signal from a compact disc player can be represented by a discrete set of numbers that represent the intensity of its sound signal at regular (small) time intervals. In such cases, we represent the signal as a sequence X =..., x_–1, x₀, x₁,... where each X_j is the numerical value of the signal at the jth time interval [t_j, t_j+1]. Theoretically, the sequence could continue indefinitely (either as j → ∞ or as j→ – ∞ or both). In reality, the signal usually stops after some point, which mathematically can be represented by x_j = 0 for |j| > N for some integer N.

The following definition describes a discrete analogue of L².

Definition 0.6 The space l² is the set of all sequences X =..., x_–1, x₀, x₁ ,..., x_i ∈ C, with The inner product on this space is defined as

c00e021

for X =..., x_–1, x₀, x₁,..., and Y =...,y_–1 y₀, y₁,....

Verifying that 〈·, ·〉 is an inner product for l² is relatively easy and will be left to the exercises.

Relative Error. For two signals, f and g, the L² norm of their difference, || f – g || _L2, provides one way of measuring how f differs from g. However, often the relative error is more meaningful:

(the denominator could also be || g || _L2). The relative error measures the L² norm of the difference between f and g in relation to the size of || f || _L2. For discrete signals, the l² norm is used.

0.3.2 Convergence in L² Versus Uniform Convergence

As defined in Section 0.2, a sequence of vectors {v_n; n = 1, 2,...} in an inner product space V is said to converge to the vector v ∈ V provided that v_n is close to v when n is large. Closeness here means that ||v_n – v|| is small. To be more mathematically precise, v_n converges to v if || v_n – v || → 0 as n → ∞.

In this text, we will often deal with the inner product space L²[a,b] and therefore we discuss convergence in this space in more detail.

Definition 0.7 A sequence f_n converges to f in L²[a, b] if || fn – f ||_L2 → 0 as n → ∞. More precisely, given any tolerance ∈ > 0, there exists a positive integer N such that if n ≥ N, then || f – f_n||_L2 < ∈.

Convergence in L² is sometimes called convergence in the mean. There are two other types of convergence often used with functions.

Definition 0.8

1. A sequence f_n converges to f pointwise on the interval a ≤ t ≤ b if for each t ∈ [a, b] and each small tolerance ∈ > 0, there is a positive integer N such that if n ≥ N, then |f_n(t) – f(t)| < ∈.

2. A sequence f_n converges to f uniformly on the interval a ≤ t ≤ b if for each small tolerance ∈ > 0, there is a positive integer N such that if n ≥ N, then |f_n(t) – f(t)| < ∈ for all a ≤ t ≤ b

For uniform convergence, the N only depends on the size of the tolerance ∈ and not on the point t, whereas for pointwise convergence, the N is allowed to also depend on the point t.

How do these three types of convergence compare? If f_n uniformly converges to f on [a,b], then the values of f_n are close to f over the entire interval [a,b]. For example, Figure 0.2 illustrates the graphs of two functions which are uniformly close to each other. By contrast, if f_n converges to f pointwise, then for each fixed t, f_n(t) is close to f(t) for large n. However, the rate at which f_n(t) approaches f(t) may depend on the point t. Thus, a sequence that converges uniformly also must converge pointwise, but not conversely.

Figure 0.2. Uniform approximation.

Figure 0.3. L² approximation.

If f_n converges to f in L²[a, b], then, on average, f_n is close to f; but for some values, f_n(t) may be far away from f(t). For example, Figure 0.3 illustrates two functions that are close in L² even though some of their function values are not close.

Example 0.9 The sequence of functions f_n(t) = tⁿ, n = 1, 2, 3..., converges pointwise to f(t) = 0 on the interval 0 ≤ t < 1 because for any number 0 ≤ t < 1, tⁿ → 0 as n → ∞. However, the convergence is not uniform. The rate at which tⁿ approaches zero becomes slower as t approaches 1. For example, if t = 1/2 and ∈ = 0.001, then |tⁿ| < ∈ provided that n ≥ 10. However, if t = 0.9, then |tⁿ| is not less than ∈ until n ≥ 66.

For any fixed number r < 1, f_n converges uniformly to f = 0 on the interval [0, r]. Indeed if 0 ≤ t ≤ r, then |tⁿ| ≤ rⁿ. Therefore, as long as rⁿ is less than ∈, | f_n(t)| will be less than ∈ for all 0 ≤ t ≤ r. In other words, the rate at which f_n approaches zero for all points on the interval [0, r] is no worse than the rate at which rⁿ approaches zero.

We also note that f_n → 0 in L²[0, 1] because

c00e023

As the following theorem shows, uniform convergence on a finite interval [a, b] is a stronger type of convergence than L² convergence.

Theorem 0.10 If a sequence f_n converges uniformly to f as n → ∞ on a finite interval a ≤ t ≤ b, then this sequence also converges to f in L²[a, b]. The converse of this statement is not true.

Proof. Using the definition of uniform convergence, we can choose, for a given tolerance ∈ > 0, an integer N such that

This inequality implies

c00e025

Therefore, if n ≥ N, we have Since ∈ can be chosen as small as desired, this inequality implies that f_n converges to f in L².

To show that the converse is false, consider the following sequence of functions on 0 ≤ t ≤ 1.

We leave it to the reader (see Exercise 6) to show that this sequence converges to the zero function in L²[0, 1] but does not converge to zero uniformly on 0 ≤ t ≤ 1 (in fact, f_n does not even converge to zero pointwise).

In general, a sequence that converges pointwise does not necessarily converge in L². However, if the sequence is uniformly bounded by a fixed function in L², then pointwise convergence is enough to guarantee convergence in L² (this is the Lebesgue Dominated Convergence Theorem; see Folland (1999)). Further examples illustrating the relationships between these three types of convergence are developed in the Exercises.

0.4 SCHWARZ AND TRIANGLE INEQUALITIES

The two most important properties of inner products are the Schwarz and triangle inequalities. The Schwarz inequality states |〈X, Y〉| ≤ ||X|| ||Y||. In R³, this inequality follows from the law of cosines:

where θ is the angle between X and Y. The triangle inequality states ||X + Y|| ≤ ||X|| + || Y ||. In R³, this inequality follows from Figure 0.4, which expresses the fact that the shortest distance between two points is a straight line.

The following theorem states that the Schwarz and Triangle inequalities hold for general inner product spaces.

Theorem 0.11 Suppose V, 〈·, ·〉 is an inner product space (either real or complex). Then for all X, Y ∈ V we have the following:

Schwarz Inequality: |〈X, Y〉| ≤ ||X|| ||Y||. Equality holds if and only if X and Y are linearly dependent. Moreover, 〈X, Y〉 = ||X|| ||Y|| if and only if X or Y is a nonnegative multiple of the other.
Triangle Inequality: ||X + Y|| ≤ ||X|| + ||Y||. Equality holds if and only if X or Y is a nonnegative multiple of the other.

Figure 0.4. Triangle inequality.

Proof.

Proof for Real Inner Product Spaces. Assume that one of the vectors, say Y, is nonzero, for otherwise there is nothing to show. Let t be a real variable and consider the following inequality:

(0.1)

(0.2)

The right side is a nonnegative quadratic polynomial in t, and so it cannot have two distinct real roots. Therefore, its discriminant (from the quadratic formula) must be nonpositive. In our case, this means

Schwarz’s inequality follows by rearranging this inequality.

If 〈X, Y〉 = ||X|| ||y||, then the preceding discriminant is zero, which means that the equation ||X – tY² = 0 has a double real root, . In particular, 0 or , which implies that . On the other hand, 〈X, Y〉 = ||X|| ||y|| is nonnegative and therefore . Thus is a nonnegative multiple of Y, as claimed. The converse (i.e., if X is a nonnegative multiple of Y, then 〈X, Y〉 = ||X|| ||Y||) is easy and left to the reader.

Proof for a Complex Inner Product Space. If V is a complex inner product space the proof is similar. We let ϕ be an argument of 〈X, Y〉, which means

Then we consider the following inequality:

c00e032

where “Re” stands for “the real part,” that is, . In view of the choice of ϕ, the middle term is just –2t|〈X, Y〉| and so the term on the right equals the expression on the right side of (0.2). The rest of the argument is now the same as the argument given for the case of real inner product space.

Proof of the Triangle Inequality. The proof of the triangle inequality now follows from the Schwarz inequality:

c00e033

Taking square roots of both sides of this inequality establishes the triangle inequality.

If the preceding inequality becomes an equality, then 〈X, Y〉 = ||X|| ||Y|| and the first part of the theorem implies that either X or Y is a nonnegative multiple of the other, as claimed.

0.5 ORTHOGONALITY

0.5.1 Definitions and Examples

For the standard inner product in R³, the law of cosines is

which implies that X and Y are orthogonal (perpendicular) if and only if 〈X, Y〉 = 0. We make this equation the definition of orthogonality in general.

Definition 0.12 Suppose V is an inner product space.

The vectors X and Y in V are said to be orthogonal if 〈X, Y〉 = 0.
The collection of vectors e_i, i = 1,..., N, is said to be orthonormal if each e_i has unit length, ||e_i|| = 1, and e_i and e_j are orthogonal for i ≠ j.
Two subspaces V₁ and V₂ of V are said to be orthogonal if each vector in V₁ is orthogonal to every vector in V₂.

An orthonormal basis or orthonormal system for V is a basis of vectors for V which is orthonormal.

Example 0.13 The line y = x generated by the vector (1, 1) is orthogonal to the line y = –x generated by (1, –1).

Example 0.14 The line x/2 = –y = z/3 in R³, which points in the direction of the vector (2, –1, 3) , is orthogonal to the plane 2x – y + 3z = 0.

Example 0.15 For the space L²([0, 1]), any two functions where the first function is zero, on the set where the second is nonzero will be orthogonal.

For example, if f(t) is nonzero only on the interval 0 ≤ t ≤ 1/2 and g(t) is nonzero only on the interval 1/2 ≤ t < 1, then is always zero. Therefore .

Example 0.16 Let

c00e035

Then ϕ and ψ are orthogonal in L²[0, 1] because

In contrast to the previous example, note that ϕ and ψ are orthogonal and yet ϕ and ψ are nonzero on the same set, namely the interval 0 ≤ t ≤ 1. The function ϕ is called the scaling function and the function ψ is called the wavelet function for the Haar system. We shall revisit these functions in the chapter on Haar wavelets.

Example 0.17 The function f(t) = sin t and g(t) = cos t are orthogonal in L²([–π, π]), because

c00e037

Since , the functions and are orthonormal in L² ([–π, π]). More generally we shall show in the chapter on Fourier Series that the functions

are orthonormal. This fact will be very important in our development of Fourier series.

Vectors can be easily expanded in terms of an orthonormal basis, as the following theorem shows.

Theorem 0.18 Suppose V₀ is a subspace of an inner product space V. Suppose {e₁,..., e_N} is an orthonormal basis for V₀. If v ∈ V₀, then

c00e039

Proof. Since {e₁,..., e_N} is a basis for V₀, any vector v ∈ V₀ can be uniquely expressed as a linear combination of the e_j:

c00e040

To evaluate the constant α_k, take the inner product of both sides with e_k:

c00e041

The only nonzero term on the right occurs when j = k since the e_j are orthonormal. Therefore

Thus, α_k = 〈v, e_k〉, as desired.

0.5.2 Orthogonal Projections

Suppose {e₁,..., e_N} is an orthonormal collection of vectors in an inner product space V . If v lies in the span of {e₁,..., e_N}, then as Theorem 0.18 demonstrates, the equation

(0.3) c00e043

is satisfied with α_j = 〈v, ej〉. If v does not lie in the linear span of {e₁,..., e_N}, then solving Eq. (0.3) for α_j is impossible. In this case, the best we can do is to determine the vector v₀ belonging to the linear span of {e₁,..., e_N} that comes as close as possible to v. More generally, suppose V₀ is a subspace of the inner product space V and suppose v ∈ V is a vector that is not in V₀ (see Figure 0.5). How can we determine the vector v₀ ∈ V₀ that is closest to v? This vector (v₀) has a special name given in the following definition.

Figure 0.5. Orthogonal projection of v onto V₀.

Definition 0.19 Suppose V₀ is a finite-dimensional subspace of an inner product space V . For any vector v ∈ V , the orthogonal projection of v onto V₀ is the unique vector v₀ ∈ V₀ that is closest to v, that is,

As Figure 0.5 indicates, the vector v₀, which is closest to v, must be chosen so that v – v₀ is orthogonal to V₀. Of course, figures are easily drawn when the underlying vector space is R² or R³. In a more complicated inner product space, such as L², figures are an abstraction, which may or may not be accurate (e.g., an element in L² is not really a point in the plane as in Figure 0.5). The following theorem states that our intuition in R² regarding orthogonality is accurate in a general inner product space.

Theorem 0.20 Suppose V₀ is a finite-dimensional subspace of an inner product space V . Let v be any element in V . Then its orthogonal projection, v₀, has the following property: v – v₀ is orthogonal to every vector in V₀.

Proof. We first show that if v₀ is the closest vector to v, then v – v₀ is orthogonal to every vector w ∈ V₀. Consider the function

which describes the square of the distance between v ₀ + tw ∈ V₀ and v. If v ₀ is the closest element of V₀ to v, then f is minimized when t = 0. For simplicity, we will consider the case where the underlying inner product space V is real. Expanding f, we have

Since f is minimized when t = 0, its derivative at t = 0 must be zero. We have

(0.4)

and we conclude that v₀ – v is orthogonal to w.

The converse also holds: If v₀ – v is orthogonal to w, then from (0.4) we have f′(0) = 0. Since f(t) is a nonnegative quadratic polynomial in t, this critical point t = 0 must correspond to a minimum. Therefore, ||v₀ + tw – v||² is minimized when t = 0. Since w is an arbitrarily chosen vector in V₀, we conclude that v₀ is the closest vector in V₀ to v.

In terms of an orthonormal basis for V₀, the projection of a vector v onto V₀ is easy to compute, as the following theorem states.

Theorem 0.21 Suppose V is an inner product space and V₀ is an N-dimensional subspace with orthonormal basis {e₁, e₂, …, e_N}. The orthogonal projection of a vector v ∈ V onto V₀ is given by

c00e049

Note. In the special case that v belongs to V₀, then v equals its orthogonal projection, v₀. In this case, the preceding formula for v₀ = v is the same as the one given in Theorem 0.18.

Proof. Let with α_j = 〈v, e_j〉. In view of Theorem 0.20, we must show that v – v₀ is orthogonal to any vector w ∈ V₀. Since e₁,..., e_N is a basis for V₀, it suffices to show that v – v₀ is orthogonal to each e_k, k = 1,..., N. We have

c00e050

Since e₁,..., e_N are orthonormal, the only contributing term to the sum is when j = k:

c00e051

Thus, v – v₀ is orthogonal to each e^k and hence to all of V₀, as desired.

Example 0.22 Let V₀ be the space spanned by cos x and sin x in L²([–π, π). As computed in Example 0.17, the functions

are orthonormal in L²([–π, π). Let f (x) = x. The projection of f onto V₀is given by

Now, f(x) cos(x) = x cos(x) is odd and so 0. For the other term,

c00e054

Therefore the projection of f (x) = x onto V₀ is given by

Example 0.23 Consider the space V₁ which is spanned by ϕ(x) = 1 on 0 ≤ x < 1 and

The functions ϕ and ψ are the Haar scaling function and wavelet function mentioned earlier. These two functions are orthonormal in L²([0, 1]). Let f(x) = x. As you can check

and

So the orthogonal projection of the function f onto V₁ is given by

c00e059

Figure 0.6.

The set of vectors that are orthogonal to a given subspace has a special name.

Definition 0.24 Suppose V₀ is a subspace of an inner product space V. The orthogonal complement of V₀, denoted , is the set of all vectors in V which are orthogonal to V₀, that is,

As Figure 0.6 indicates, each vector can be written as a sum of a vector in V₀ and a vector in . The intuition from this Euclidean figure is accurate for more general inner product spaces as the following theorem demonstrates.

Theorem 0.25 Suppose V₀ is a finite-dimensional subspace of an inner product space V. Each vector v ∈ V can be written uniquely as v = v₀ + v₁, where v ₀ belongs to V₀ and v₁ belongs to , that is,

Proof. Suppose v belongs to V and let v ₀ be its orthogonal projection onto V₀. Let v₁ = v – v₀; then

By Theorem 0.20, v₁ is orthogonal to every vector in V₀. Therefore, v₁ belongs to .

Example 0.26 Consider the plane V₀ = {2x – y + 3z = 0}. The set of vectors

forms an orthonormal basis. So given v = (x, y, z) ∈ R³, the vector

c00e064

is the orthogonal projection of v onto the plane V₀.

The vector is a unit vector that is perpendicular to this plane. So

c00e065

is the orthogonal projection of v = (x, y, z) onto .

The theorems in this section are valid for certain infinite-dimensional sub-spaces, but a discussion of infinite dimensions involves more advanced ideas from functional analysis (see Rudin (1973)).

0.5.3 Gram-Schmidt Orthogonalization

Theorems 0.18 and 0.21 indicate the importance of finding an orthonormal basis. Without an orthonormal basis, the computation of an orthogonal projection onto a subspace is more difficult. If an orthonormal basis is not readily available, then the Gram-Schmidt orthogonalization procedure describes a way to construct one for a given subspace.

Theorem 0.27 Suppose V₀ is a subspace of dimension N of an inner product space V. Let v_j, j = 1,..., N be a basis for V₀. Then there is an orthonormal basis {e₁,..., e_N} for V₀ such that each e_j is a linear combination of v₁,..., v_j.

Proof. We first define e₁ = v₁/||v₁||. Clearly, e₁ has unit length. Let v₀ be the orthogonal projection of v₂ onto the line spanned by e₁. From Theorem 0.21,

Figure 0.7 suggests that the vector from v₀ to v₂ is orthogonal to e₁. So let

Figure 0.7. Gram–Schmidt orthogonalization.

and note that

which confirms our intuition from Figure 0.7.

Note that E₂ cannot equal zero because otherwise v₂ and e₁ (and hence v₂ and v₁) would be linearly dependent. To get a vector of unit length, we define e₂ = E₂/||E₂||. Both e₁ and e₂ are orthogonal to each other; and since e₁ is a multiple of v₁, the vector e₂ is a linear combination of v¹ and v₂.

If N > 2, then we continue the process. We consider the orthogonal projection of v₃ onto the space spanned by e₁ and e₂:

Then let

and set e₃ = E₃/||E₃||. The same argument as before (for E₂) shows that E₃ is orthogonal to both e₁ and e₂. Thus, {e₁, e₂, e₃} is an orthonormal set of vectors. The pattern is now clear.

0.6 LINEAR OPERATORS AND THEIR ADJOINTS

0.6.1 Linear Operators

First, we recall the definition of a linear map.

Definition 0.28 A linear operator (or map) between a vector space V and a vector space W is a function T : V → W that satisfies

If V and W are finite-dimensional, then T is often identified with its matrix representation with respect to a given choice of bases, say {v₁,... ,v_n) for V and {w₁,..., w_m} for W. For each 1 ≤ j ≤ n, T(v_j) belongs to W and therefore it can be expanded in terms of w₁,..., w_m:

(0.5)

where a_ij are complex numbers. The value of T(v) for any vector can be computed by

c00e073

The coefficient of w_i is , which can be identified with the ith entry in the following matrix product:

c00e074

Thus, the matrix (a_ij) is determined by how the basis vectors v_j ∈ V are mapped into the basis vectors w_i of W [see Eq. (0.5)]. The matrix then determines how an arbitrary vector v maps into W.

Orthogonal projections also define linear operators. If V₀ is a finite dimensional subspace of an inner product space V, than by Theorem 0.25 we can define the orthogonal projection onto V₀ to be P, where P(v) = v₀. By Theorem 0.21, it is easy to show that P is a linear operator.

A linear operator T : V → W is said to be bounded if there is a number 0 ≤ M < ∞ such that

In this case, the norm of T is defined to be the smallest such M. In words, a bounded operator takes the unit ball in V to a bounded set in W. As a result, a bounded operator takes bounded sets in V to bounded sets in W. All linear maps between finite-dimensional inner product spaces are bounded. As another example the orthogonal projection map from any inner product space to any of its subspaces (finite or infinite dimensions) is a bounded linear operator.

0.6.2 Adjoints

If V and W are inner product spaces, then we will sometimes need to compute (T(v), w)_W by shifting the operator T to the other side of the inner product. In other words, we want to write

for some operator T^* : W → V . We formalize the definition of such a map as follows.

Definition 0.29 If T : V → W is a linear operator between two inner product spaces. The adjoint of T is the linear operator T ^* : W → V , which satisfies

Every bounded linear operator between two inner product spaces always has an adjoint. Here are two examples of adjoints of linear operators.

Example 0.30 Let V = Cⁿ and W = C^m with the standard inner products. Suppose T : Cⁿ → C^m is a linear operator with matrix a_ij ∈ C with respect to the standard basis

If X = (x₁,...,x_n) ∈ Cⁿ, and Y = (y₁,...,y_m) ∈ C^m, then

c00e079

where (the conjugate of the transpose). The right side is 〈X, T*(Y)〉, where the jth component of T*(Y) is . Thus the matrix for the adjoint of T is the conjugate of the transpose of the matrix for T.

Example 0.31 Suppose g is a bounded function on the interval a ≤ x ≤ b. Let T_g : L²([a, b]) → L²([a, b]) be defined by

The adjoint of T_g is just

because

The next theorem computes the adjoint of the composition of two operators.

Theorem 0.32 Suppose T₁ : V → W and T₂ : W →U are bounded linear operators between inner product spaces. Then .

Proof. For v ∈ V and u ∈ U, we have

On the other hand, the definition of adjoint implies

Therefore

(0.6)

all v ∈ V. By Exercise 17, if u₀ and u₁ are vectors in V with 〈v, u₀〉 = 〈v, u₁〉 for all v ∈ V, then u₀ = u₁. Therefore, from Eq. (0.6), we conclude that

as desired.

In the next theorem, we compute the adjoint of an orthogonal projection.

Theorem 0.33 Suppose V₀ is a subspace of an inner product space V. Let P be the orthogonal projection onto V₀. Then P* = P.

Proof. From Theorem 0.25, each vector v ∈ V can be written as v = v₀ + v₁ with v₀ ∈ V₀ and v¹ orthogonal to V₀. Note that P(v) = v₀. Similarly, we can write any u ∈ V as u = u₀ + u₁ with u₀ = P(u) ∈ V₀ and . We have

since v₀ ∈ V₀ and .

Similarly,

Therefore 〈v, P(u)〉 = 〈P(v), u〉 and so P* = P.

0.7 LEAST SQUARES AND LINEAR PREDICTIVE CODING

In this section, we apply some of the basic facts about linear algebra and inner product spaces to the topic of least squares analysis. As motivation, we first describe how to find the best-fit line to a collection of data. Then, the general idea behind the least squares algorithm is presented. As a further application of least squares, we present the ideas behind linear predictive coding, which is a data compression algorithm. In the next chapter, least squares will be used to develop a procedure for approximating signals (or functions) by trigonometric polynomials.

0.7.1 Best-Fit Line for Data

Consider the following problem. Suppose real data points x_i and y_i for i = 1,..., N are given with x_i ≠ x_j for i ≠ j. We wish to find the equation of the line y = mx + b which comes closest to fitting all the data. Figure 0.8 gives an example of four data points (indicated by the four small circles) as well as the graph of the line which comes closest to passing through these four points.

The word closest here means that the sum of the squares of the errors (between the data and the line) is smaller than the corresponding error with any other line. Suppose the line we seek is y = mx + b. As Figure 0.9 demonstrates, the error between this line at x = x_i and the data point (x_i, y_i) is |y_i – (mx_i + b)|. Therefore, we seek numbers m and b which minimize the quantity

c00e089

Figure 0.8. Least squares approximation.

Figure 0.9. Error at x_i is |y_i – (mx_i + b)| (length of dashed line).

The quantity E can be viewed as the square of the distance (in R^N) from the vector

c00e090

and the vector mX + bU, where

(0.7) c00e091

As m and b vary over all possible real numbers, the expression mX + bU sweeps out a two-dimensional plane, M in R^N. Thus our problem of least squares has the following geometric interpretation: Find the point P = mX + bU on M, which is closest to the point Y (see Figure 0.10). The point P must be the orthogonal projection of Y onto M. In particular, Y – P must be orthogonal to M. Since M is generated by the vectors X and U, Y -–P must be orthogonal to both X and U. Therefore, we seek the point P = mX + bU that satisfies the following two equations:

These equations can be rewritten in matrix form as

c00e094

Figure 0.10. Closest point on the plane to the point Y .

Keep in mind that the x_i and y_i are the known data points. The solution to our least squares problem is obtained by solving the preceding linear system for the unknowns m and b. This discussion is summarized in the following theorem.

Theorem 0.34 Suppose X = {x₁, x₂,..., x_N} and Y = {y₁, y₂,..., y_N} are two sets of real data points. The equation of the line y = mx + b which most closely approximates the data (x₁, y₁),..., (x_N, y_N) in the sense of least squares is obtained by solving the linear equation

for m and b where

c00e096

If the x_i are distinct, then this system of equations has the following unique solution for m and b:

where and .

Proof. We leave the computation of the formulas for m and b as an exercise (see Exercise 24). The statement regarding the unique solution for m and b will follow once we show that the matrix Z^TZ is nonsingular (i.e., invertible). If the x_i, are not all the same (in particular, if they are distinct), then the vectors X and U are linearly independent and so the matrix Z has rank two. In addition, for any V ∈ R²

c00e098

Since Z is a matrix of maximal rank (i.e., rank two), the only way (Z)V can be zero is if V is zero. Therefore, 〈(Z^TZ)V, V〉 > 0 for all nonzero V, which means that Z^TZ is positive definite. In addition, the matrix Z^TZ is symmetric because its transpose, (Z^TZ)^T, equals itself. Using a standard fact from linear algebra, this positive definite symmetric matrix must be nonsingular. Thus, the equation

has a unique solution for m and b.

0.7.2 General Least Squares Algorithm

Suppose Z is a N × q matrix (with possibly complex entries) and let Y be a vector in R^N (or C^N). Linear algebra is, in part, the study of the equation ZV = Y, which when written out in detail is

c00e100

If N > q, then the equation ZV = Y does not usually have a solution for V v C^q because there are more equations (N) than there are unknowns (v₁,... , v_q). If there is no solution, the problem of least squares asks for the next best quantity: Find the vector V ∈ C^q such that ZV is as close as possible to Y.

In the case of finding the best-fit line to a set of data points (x_i , y_i), i = 1... N, the matrix Z is

(0.8) c00e101

and the vectors Y and V are

c00e102

In this case, the matrix product ZV is

c00e103

where X and U are the vectors given in Eq. (0.7). Thus finding the V = (m, b) so that ZV is closest to Y is equivalent to finding the slope and y-intercept of the best fit line to the data (x_i, y_i), i = 1 ,..., N, as in the last section.

The solution to the general least squares problem is given in the following theorem.

Theorem 0.35 Suppose Z is an N × q matrix (with possibly complex entries) of maximal rank and with N ≥ q. Let Y be a vector in R^N (or C^N). There is a unique vector V ∈ C^q such that ZV is closest to Y. Moreover, the vector V is the unique solution to the matrix equation

Figure 0.11. Y – ZV must be orthogonal to M = span{Z₁,..., Z_q}.

If Z is a matrix with real entries, then the preceding equation becomes

Note that in the case of the best-fit line, the matrix Z in Eq. (0.8) and the equation Z^TY = Z^TZV are the same as those given in Theorem 0.34.

Proof. The proof of this theorem is similar to the proof given in the construction of the best-fit line. We let Z₁,..., Z_q be the columns of the matrix Z. Then ZV = v₁Z₁ + … + v_qZ_q is a point that lies in the subspace M ⊂ C^N generated by Z₁,..., Z_q. We wish to find the point ZV that is closest to Y. As in Figure 0.11, Y – ZV must be orthogonal to M ; or equivalently, Y – ZV must be orthogonal to Z₁,..., Z_q which generate M. Thus

These equations can be written succinctly as

because the ith component of this (vector) equation is the inner product of Y – ZV with Z_i. This equation can be rearranged to read

as claimed in the theorem.

The matrix Z*Z has dimension q × q and by the same arguments used in the proof of Theorem 0.34, you can show that this matrix is nonsingular (using the fact that Z has maximal rank). Therefore, the equation

has a unique solution V ∈ C^q as claimed.

Example 0.36 Suppose a set of real data points {(x_i , y_i), i = 1,... , N} behaves in a quadratic rather than a linear fashion, then a best-fit quadratic equation of the form y = ax² + bx + c can be found. In this case, we seek a, b and c which minimize the quantity

c00e110

We can apply Theorem 0.35 with

c00e111

From Theorem 0.35, the solution V = (a, b, c) to this least squares problem is the solution to Z^T ZV = Z^TY. Exercise 28 asks you to solve this system with specific numerical data.

0.7.3 Linear Predictive Coding

Here, we will apply the least squares analysis procedure to the problem of efficiently transmitting a signal. As mentioned earlier, computers can process millions—and in some cases, billions—of instructions per second. However, if the output must be transmitted from one location to another (say a picture downloaded from the web), the signal must often be sent over telephone lines or some other medium that can only transmit thousands of bytes per second (in the case of telephone lines, this rate is currently about 60 kilobytes per second). Therefore instead of transmitting all the data points of a signal, some sort of coding algorithm (data compression) is applied so that only the essential parts of the signal are transmitted.

Let us suppose we are transmitting a signal, which after some discretization process can be thought of as a long string of numbers

(zeros and ones, perhaps). For simplicity, we will assume that each x_i is real. Often, there is a repetition of some pattern of the signal (redundancy). If the repetition is perfect (say 1,1,0, 1,1,0, 1,1,0, etc.), then there would be no need to send all the digits. We would only need to transmit the pattern 1,1,0 and the number of times this pattern is repeated. Usually, however, there is not a perfect repetition of a pattern, but there may be some pattern that is nearly repetitive. For example, the rhythm of a beating heart is nearly, but not exactly, repetitive (if it is a healthy heart). If there is a near repetition of some pattern, then the following linear predictive coding procedure can achieve significant compression of data.

Main Idea. The idea behind linear predictive coding is to divide up the data into blocks of length N, where N is a large number.

Let’s consider the first block of data x₁,... , x_N. We choose a number p that should be small compared to N. The linear predictive coding scheme will provide the best results (best compression) if p is chosen close to the number of digits in the near repetitive pattern of this block of data. Next, we try to find numbers a₁, a₂,... , a_p that minimize the terms

(0.9) c00e114

in the sense of least squares. Once this is done (the details of which will be presented later), then the idea is to transmit x₁... x_p as well as a₁,... , a_p. Instead of transmitting x_p+1, x_p+2,... , we use the following scheme starting with n = p + 1. If e(p + 1) is smaller than some specified tolerance, then we can treat e(p + 1) as zero. By letting n = p + 1 and e(p + 1) = 0 in Eq. (0.9), we have

c00e115

There is no need to transmit x_p+1 because the data x₁... x_p as well as a₁... a_p have already been transmitted and so the receiver can reconstruct x_p+1 according to the preceding formula. If e(p + 1) is larger than the specified tolerance, then x_p+1 (or equivalently e(p + 1) ) needs to be transmitted.

Once the receiver has reconstructed (or received) x_p+1, n can be incremented to p + 2 in Eq. (0.9). If e_p+2 is smaller than the tolerance, then x_p+2 does not need to be transmitted and the receiver can reconstruct x_p+2 by setting e(p + 2) = 0 in Eq. (0.9), giving

The rest of the x_p+3,... , x_N can be reconstructed by the receiver in a similar fashion.

The hope is that if the a_i have been chosen to minimize {e(p + 1),... , e(N)} in the sense of least squares, then most of the |e(n)| will be less than the specified tolerance and therefore most of the x_n can be reconstructed by the receiver and not actually transmitted. The result is that instead of transmitting N pieces of data (i.e., x₁,... , x_N), we only need to transmit 2p pieces of data (i.e., a₁,... , a_p and x₁,... , x_p) and those (hopefully few) values of x_n where |e(n)| is larger than the tolerance. Since 2p is typically much less than N, significant data compression can be achieved. The other blocks of data can be handled similarly, with possibly different values of p.

Role of Least Squares. To find the coefficients a₁,... , a_p, we use Theorem 0.35. We start by putting Eq. (0.9), for n = p + 1,... N, in matrix form:

where

c00e118

and

c00e119

We want to choose V = (a₁,..., a_p)^T so that ||E|| is as small as possible—or in other words, so that ZV is as close as possible to Y. From Theorem 0.35, V = (a₁,..., a_p)^T is found by solving the following (real) matrix equation:

Written out in detail, this equation is

(0.10) c00e121

where we have labeled the columns of the matrix of Z by Z_p,... , Z₁ (reverse order). The horizontal dots on either side of the indicate that these entries are row vectors. Likewise, the vertical dots above and below the Z_i indicate that these entries are column vectors.

Equation (0.10) is a p × p system of equations for the a₁,... , a_p which can be solved in terms of the Z-vectors (i.e., the original signal points, x) via Gaussian elimination.

Summary of Linear Predictive Coding. Linear predictive coding involves the following procedure.

1. Sender cuts the data into blocks

where each block has some near repetitive pattern. Then choose p close to the length of the repetitive pattern for the first block.

2. For 1 ≤ i ≤ p, form the vectors

c00e123

3. Sender solves the system of equations (0.10) for the coefficients a₁,... , a_p and transmits to the receiver both a₁,... , a_p and x₁,... , x_p.

4. The receiver then reconstructs x_p+1,... , x_N (in this order) via the equation

for those x_n where the corresponding least squares errors, e(n), are smaller than some specified tolerance. If e(n) is larger than the tolerance, then the sender must transmit x_n.

Certainly, some work is required for the sender to solve the preceding equations for the a₁,... , a_p and for the receiver to reconstruct the x_n. One may wonder whether this work is more than the energy required to transmit all the x_i. However, it should be kept in mind that the work required to solve for the a_i and reconstruct the x_n is done by the sender and receiver with computers that can do millions or billions of operations per second, whereas the transmission lines may only handle thousands of data bits per second. So the goal is to shift, as much as possible, the burden from the relatively slow process of transmitting the data to the much faster process of performing computations by the computers located at either the sender or the receiver.

EXERCISES

1. Verify that the function

c00e125

for Z = (Z₁,... , Z_n), W = (W₁ ,..., W_n) ∈ Cⁿ defines an inner product on Cⁿ (i.e., satisfies Definition 0.1).

2. Verify that the functions 〈 , 〉 defined in Examples 0.2 and 0.3 are inner products.

3. Define 〈V, W〉 for V = (v₁, v₂) and W = (w₁, w₂) ∈ C² as

Show that 〈V, V〉 = 0 for all vectors V = 〈v₁, v₂} with v₁ + 2v₂ = 0. Does 〈, 〉 define an inner product?

4. Show that the L²[a, b] inner product satisfies the following properties.

The L² inner product is conjugate-symmetric (i.e., , homogeneous, and bilinear (these properties are listed in Definition 0.1).
Show that the L² inner product satisfies positivity on the space of continuous functions on [a, b] by using the following outline.

(a) We want to show that if , then f(t) = 0 for all a ≤ t ≤ b.

(b) Suppose, by contradiction, that |f(t₀)| > 0; then use the definition of continuity to show that |f(t)| > |f(t₀)|/2 on an interval of the form [t₀ – δ, t0 + δ].

which contradicts the assumption that .

5. Show that defines an inner product on l².

6. For n > 0, let

Show that f_n → 0 in L²[0, 1]. Show that f_n does not converge to zero uniformly on [0, 1].

7. For n > 0, let

Show that f_n → 0 in L²[0, 1] but that f_n(0) does not converge to zero.

8. Is Theorem 0.10 true on an infinite interval such as [0, ∞)?

9. Compute the orthogonal complement of the space in R³ spanned by the vector (1, –2, 1).

10. Let f(t) = 1 on 0 ≤ t ≤ 1. Show that the orthogonal complement of f in L²[0, 1] is the set of all functions whose average value is zero.

11. Show that if a differentiable function, f, is orthogonal to cos(t) on L²[0, π] then f′ is orthogonal to sin(t) in L²[0, π]. Hint: Integrate by parts.

12. By using the Gram-Schmidt Orthogonalization, find an orthonormal basis for the subspace of L²[0, 1] spanned by 1, x, x², x³.

13. Find the L²[0, 1] projection of the function cosx onto the space spanned by 1, x, x², x³.

14. Find the L²[–π, π] projection of the function f(x) = x² onto the space V_n ⊂ L²[–π, π] spanned by

for n = 1. Repeat this exercise for n = 2 and n = 3. Plot these projections along with f using a computer algebra system. Repeat for g(x) = x³.

15. Project the function f(x) = x onto the space spanned by ϕ(x) ψ(x) ψ(2x) ψ(2x – 1) ∈ L²[0, 1], where

c00e131

16. Let D = {(x, y) ∈ R²; x² + y² ≤ 1}. Let

Define an inner product on L²(D) by

Let ϕ_n(x, y) = (x + iy)ⁿ, n = 0, 1, 2,.... Show that this collection of functions is orthogonal in L²(D) and compute ||ϕ_n||. Hint: Use polar coordinates.

17. Suppose u₀ and u₁ are vectors in the inner product space V with 〈u₀, v>〉 = 〈u₁, v〉 for all v ∈ V. Show that u₀ = u₁. Hint: Let v = u₀ – u₁.

18. Suppose A is an n × n matrix with complex entries. Show that the following are equivalent.

(a) The rows of A form an orthonormal basis in Cⁿ.

(b) AA* = I (the identity matrix).

19. Suppose K(x, y) is a continuous function which vanishes outside a bounded set in R × R. Define T : L²(R) →L²(R) by

Show . Note the parallel with the adjoint of a matrix .

20. Suppose A : V → W is a linear map between two inner product spaces. Show that . Note: Ker stands for Kernel; Ker(A*) is the set of all vectors in W which are sent to zero by A*.

21. Prove the following theorem (Fredholm’s Alternative). Suppose A : V → W is a linear map between two inner product spaces. Let b be any element in W. Then either

Ax = b has a solution for some x ∈ V or
There is a vector w ∈ W with A*w = 0 and 〈b, w〉_W ≠ 0.

22. Suppose V₀ is a finite-dimensional subspace of an inner product space, V. Show that . Hint: The inclusion ⊂ is easy; for the reverse inclusion, take any element and then use Theorem 0.25 to decompose w into its components in V₀ and . Show that its component is zero.

23. Show that a set of orthonormal vectors is linearly independent.

24. Verify the formulas for m and b given in Theorem 0.34.

25. Prove the uniqueness part of Theorem 0.35; Hint: See the proof of the uniqueness part of Theorem 0.34.

26. Obtain an alternative proof (using calculus) of Theorem 0.34 by using the following outline.

(a) Show that the least squares problem is equivalent to finding m and b to minimize the error quantity

c00e135

(b) From calculus, show that this minimum occurs when

27. Obtain the best-fit least squares line for these data:

28. Repeat the previous problem with the best-fit least squares parabola.

29. This exercise is best done with Matlab (or something equivalent). The goal of this exercise is to use linear predictive coding to compress strings of numbers. Choose X = (x₁ ,..., x_N), where x_j a is periodic sequence of period p and length N. For example, try x_j = sin(jπ/3) for 1 ≤ j ≤ N = 60 is a periodic sequence of length p = 6. Apply the linear predictive coding scheme to compute a₁ ,..., a_p. Compute the residual E = Y – ZV. If done correctly, this residual should be theoretically zero (although the use of a computer will introduce a small round-off error). Now perturb X by a small randomly generated sequence (in Matlab, add rand(1,60) to X). Then re-apply linear predictive coding and see how many terms in the residual E are small (say less than 0.1). Repeat with other sequences X on your own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 0: INNER PRODUCT SPACES

Create new playlist

Sign In

Sign Up

Table of Contents for
0: INNER PRODUCT SPACES