Appendix A. Linear Modeling and Linear Algebra Basics

Overview of Linear Classification

When we have a labeled dataset, the feature space is strewn with data points from different classes. It is the job of the classifier to separate the data points from different classes. It can do so by producing an output that is very different for data points from one class versus another. For instance, when there are only two classes, then a good classifier should produce large outputs for one class, and small ones for another. The points right on the cusp of being in one class versus another form a decision surface (Figure A-1).

Figure A-1. Simple binary classification finds a surface that separates two classes of data points

Many functions can be made into classifiers. It’s a good idea to look for the simplest function that cleanly separates the classes, for a few reasons. First of all, it’s easier to find the best simple separator than the best complex separator. Also, simple functions often generalize better to new data, because it’s harder to tailor them too specifically to the training data (a concept known as overfitting). A simple model might make mistakes—like in Figure A-1, where some points are on the wrong side of the divide—but we’re willing to sacrifice some training accuracy in order to have a simpler decision surface that can achieve better test accuracy. The principle of minimizing complexity and maximizing usefulness is called “Occam’s razor,” and is widely applicable in science and engineering.

The simplest function is a line. A linear function of one input variable is a familiar sight (Figure A-2).

Figure A-2. A linear function of one input variable

A linear function with two input variables can be visualized as either a flat plane in 3D or a contour plot in 2D (shown in Figure A-3). Like a topological geographic map, each line of the contour plot represents points in the input space that have the same output.

It’s harder to visualize higher-dimensional linear functions, which are called hyperplanes. But it’s easy enough to write down the algebraic formula. A multidimensional linear function has a set of inputs x1, x2, ..., xn and a set of weight parameters w0, w1, ..., wn:

fw(x1, x2, ..., xn) = w0 + w1 * x1 + w2 * x2 + ... + wn * xn

It can be written more succinctly using vector notation:

fw(x) = xTw

Figure A-3. Contour plot of a linear function in 2D

We follow the usual convention for mathematical notations, which uses boldface to indicate a vector and non-boldface to indicate a scalar. The vector x is padded with an extra 1 at the beginning, as a placeholder for the intercept term w0. If all input features are 0, then the output of the function is w0. So, w0 is also known as the bias or intercept term

Training a linear classifier is equivalent to picking out the best separating hyperplane between the classes. This translates into finding the best vector w that is oriented exactly right in space. Since each data point has a target label y, we could find a w that tries to directly emulate the target label:1

xTw = y

Since there is usually more than one data point, we want a w that simultaneously makes all of the predictions close to the target labels:

Aw = y

Here, A is known as the data matrix (also known as the design matrix in statistics). It contains the data in a particular form: each row is a data point and each column a feature. (Sometimes people also look at its transpose, where features are the rows and data points the columns.)

The Anatomy of a Matrix

In order to solve the preceding equation, we need some basic knowledge of linear algebra. For a systematic introduction to the subject, we highly recommend Strang (2006).

The equation states that when a certain matrix multiplies a certain vector, there is a certain outcome. A matrix is also called a linear operator, a name that makes it more apparent that a matrix is a little machine. This machine takes a vector as input and spits out another vector using a combination of several key operations: rotating a vector’s direction, adding or subtracting dimensions, and stretching or compressing its length. This combination can be quite powerful for manipulating shapes in the input space.

For example, as Figure A-4 shows, a 3 × 2 matrix can transform a square area in 2D into a diamond-shaped area in 3D. It does so by rotating and stretching each vector in the input space into a new vector in the output space.

Figure A-4. A 2D to 3D matrix transformation

From Vectors to Subspaces

In order to understand a linear operator, we have to look at how it morphs the input into output. Luckily, we don’t have to analyze one input vector at a time. Vectors can be organized into subspaces, and linear operators manipulate vector subspaces.

A subspace is a set of vectors that satisfies two criteria. First, if it contains a vector, then it contains the line that passes through the origin and that point. Second, if it contains two points, then it contains all the linear combinations of those two vectors. Linear combination is a combination of two types of operations: multiplying a vector with a scalar, and adding two vectors together.

One important property of a subspace is its rank or dimensionality, which is a measure of the degrees of freedom in this space. A line has rank 1, a 2D plane has rank 2, and so on. If you can imagine a multidimensional bird in our multidimensional space, then the rank of the subspace tells us in how many “independent” directions the bird could fly. “Independence” here means “linear independence”: two vectors are linearly independent if one isn’t a constant multiple of another (i.e., they are not pointing in exactly the same or opposite directions).

A subspace can be defined as the span of a set of basis vectors. (Span is a technical term that describes the set of all linear combinations of a set of vectors.) The span of a set of vectors is invariant under linear combinations (because it’s defined that way). So, if we have one set of basis vectors, then we can multiply the vectors by any nonzero constants or add the vectors to get another basis.

It would be nice to have a more unique and identifiable basis to describe a subspace. An orthonormal basis contains vectors that have unit length and are orthogonal to each other. Orthogonality is another technical term. (At least 50% of all math and science is made up of technical terms. If you don’t believe me, do a bag-of-words count on this book.) Two vectors are orthogonal to each other if their inner product is zero. For all intents and purposes, we can think of orthogonal vectors as being at 90 degrees to each other. (This is true in Euclidean space, which closely resembles our physical 3D reality.) Normalizing these vectors to have unit length turns them into a uniform set of measuring sticks.

All in all, a subspace is like a tent, and the orthogonal basis vectors are the number of poles at right angles that are required to prop up the tent. The rank is equal to the total number of orthogonal basis vectors. Figure A-5 illustrates some these concepts.

Figure A-5. Illustrations of four useful linear algebra concepts: inner product, linear combination, basis vectors, and orthogonal basis vectors

Singular Value Decomposition (SVD)

A matrix performs a linear transformation on the input vector. Linear transformations are very simple and constrained. It follows that a matrix can’t manipulate a subspace willy-nilly. One of the most fascinating theorems of linear algebra proves that every square matrix, no matter what numbers it contains, must map a certain set of vectors back to themselves with some scaling. In the general case of a rectangular matrix, it maps a set of input vectors into a corresponding set of output vectors, and its transpose maps those outputs back to the original inputs. The technical terminology is that square matrices have eigenvectors with eigenvalues, and rectangular matrices have left and right singular vectors with singular values.

Algebraically, the SVD of a matrix looks like this:

A = UΣVT

where the columns of the matrices U and V form orthonormal bases of the input and output space, respectively. Σ is a diagonal matrix containing the singular values.

Geometrically, a matrix performs the following sequence of transformations:

  1. Map the input vector onto the right singular basis vector.
  2. Scale each coordinate by the corresponding singular values.
  3. Multiply this score with each of the left singular vectors.
  4. Sum up the results.

Figure A-6 provides an illustration. The operations go from right to left for a matrix-vector multiplication. The rightmost machine rotates and potentially projects the input into a lower-dimensional space. In this illustration, the input cube becomes a flat square, and is also rotated. The next machine squeezes the square in one direction and stretches it in another; the square becomes a rectangle. The last, leftmost machine rotates the rectangle again, and projects it back out into a possibly higher-dimensional space—but it remains a flat rectangle instead of some higher-dimensional object.

Figure A-6. A matrix decomposed into three little machines: rotate, scale, rotate

When A is a real matrix (i.e., all of the elements are real-valued), all of the singular values and singular vectors are real-valued. A singular value can be positive, negative, or zero. The ordered set of singular values of a matrix is called its spectrum, and it reveals a lot about the matrix. The gap between the singular values affects how stable the solutions are, and the ratio between the maximum and minimum absolute singular values (the condition number) affects how quickly an iterative solver can find the solution. Both of these properties have notable impacts on the quality of the solution one can find.

The Four Fundamental Subspaces of the Data Matrix

Another useful way to dissect a matrix is via the four fundamental subspaces: column space, row space, null space, and left null space. These four subspaces completely characterize the solutions to linear systems involving A or AT (hence the moniker).

For the data matrix (where the rows are data points and columns are features), the four fundamental subspaces can be understood in relation to the data and features. Let’s look at them in more detail.

Column space

Mathematical definition:
The set of output vectors s where s = Aw as we vary the weight vector w.
Mathematical interpretation:
All possible linear combinations of columns.
Data interpretation:
All outcomes that are linearly predictable based on observed features. The vector w contains the weight of each feature.
Basis:
The left singular vectors corresponding to nonzero singular values (a subset of the columns of U).

Row space

Mathematical definition:
The set of output vectors r where r = uTA as we vary the weight vector u.
Mathematical interpretation:
All possible linear combinations of rows.
Data interpretation:
A vector in the row space is something that can be represented as a linear combination of existing data points. Hence, this can be interpreted as the space of “non-novel” data. The vector u contains the weight of each data point in the linear combination.
Basis:
The right singular vectors corresponding to nonzero singular values (a subset of the columns of V).

Null space

Mathematical definition:
The set of input vectors w where  Aw = 0.
Mathematical interpretation:
Vectors that are orthogonal to all rows of A. The null space gets squashed to zero by the matrix. This is the “fluff” that adds volume to the solution space of Aw = y.
Data interpretation:
“Novel” data points that cannot be represented as any linear combination of existing data points.
Basis:
The right singular vectors corresponding to the zero singular values (the rest of the columns of V).

Left null space

Mathematical definition:
The set of input vectors u where uTA = 0.
Mathematical interpretation:
Vectors that are orthogonal to all columns of A. The left null space is orthogonal to the column space.
Data interpretation:
“Novel feature vectors" that are not representable by linear combinations of existing features.
Basis:
The left singular vectors corresponding to the zero singular values (the rest of the columns of U).

Column space and row space contain what is already representable based on observed data and features. Those vectors that lie in the column space are non-novel features. Those vectors that lie in the row space are non-novel data points.

For the purposes of modeling and prediction, non-novelty is good. A full column space means that the feature set contains enough information to model any target vector we wish. A full row space means that the different data points contain enough variation to cover all possible corners of the feature space. It’s the novel data points and features—respectively contained in the null space and the left null space—that we have to worry about.

In the application of building linear models of data, the null space can also be viewed as the subspace of “novel” data points. Novelty is not a good thing in this context. Novel data points indicate phantom data that is not linearly representable by the training set. Similarly, the left null space contains novel features that are not representable as linear combinations of existing features.

The null space is orthogonal to the row space. It’s easy to see why. The definition of null space states that w has an inner product of 0 with every row vector in A. Therefore, w is orthogonal to the space spanned by these row vectors, i.e., the row space. Similarly, the left null space is orthogonal to the column space.

Solving a Linear System

Let’s tie all this math back to the problem at hand: training a linear classifier, which is intimately connected to the task of solving a linear system. We look closely at how a matrix operates because we have to reverse engineer it. In order to train a linear model, we have to find the input weight vector w that maps to the observed output targets y in the system Aw = y, where A is the data matrix.2

Let’s try to crank the machine of the linear operator in reverse. If we had the SVD decomposition of A, then we could map y onto the left singular vectors (columns of U), reverse the scaling factors (multiply by the inverse of the nonzero singular values), and finally map them back to the right singular vectors (columns of V). Ta-da! Simple, right?

This is in fact the process of computing the pseudo-inverse of A. It makes use of a key property of an orthonormal basis: the transpose is the inverse. This is why SVD is so powerful. (In practice, real linear system solvers do not use the SVD, because it’s rather expensive to compute. There are other, much cheaper ways to decompose a matrix, such as QR or LU or Cholesky decompositions.)

However, we skipped one tiny little detail in our haste. What happens if the singular value is zero? We can’t take the inverse of 0 because 1/0 = ∞. This is why it’s called the pseudo-inverse. (The real inverse isn’t even defined for rectangular matrices. Only square matrices have them, as long as all of the eigenvalues are nonzero.) A singular value of zero squashes whatever input was given; there’s no way to retrace its steps and come up with the original input.

Okay, going backward we get stuck on this one little detail. Let’s take what we’ve got and go forward again to see if we can unjam the machine. Suppose we came up with an answer to Aw = y. Let’s call it wparticular, because it’s particularly suited for y. Suppose that there are also a bunch of input vectors that A squashes to zero. Let’s take one of them and call it wsad-trumpet, because wah wah. Then, what do you think happens when we add wparticular to wsad-trumpet?

A(wparticular + wsad-trumpet) = y

Amazing! So this is a solution too. In fact, any input that gets squashed to zero could be added to a particular solution and give us another solution. The general solution looks like this:

wgeneral = wparticular + whomogeneous

wparticular is an exact solution to the equation Aw = y. There may or may not be such a solution. If there isn’t, then the system can only be approximately solved. If there is, then y belongs to what’s known as the column space of A. The column space is the set of vectors that A can map to, by taking linear combinations of its columns.

whomogeneous is a solution to the equation Aw = 0. (The grown-up name for wsad-trumpet is whomogeneous.) This should now look familiar. The set of all whomogeneous vectors forms the null space of A. This is the span of the right singular vectors with singular value 0.

The name “null space” sounds like the destination of woe for an existential crisis. If the null space contains any vectors other than the all-zero vector, then there are infinitely many solutions to the equation Aw = y. Having too many solutions to choose from is not in itself a bad thing. Sometimes any solution will do. But if there are many possible answers, then there are many sets of features that are useful for the classification task. It becomes difficult to understand which ones are truly important.

One way to fix the problem of a large null space is to regulate the model by adding additional constraints:

Aw = y,

where w is such that wTw = c.

This form of regularization constrains the weight vector to have a certain norm, c. The strength of this regularization is controlled by a regularization parameter, which must be tuned, as is done in our experiments.

In general, feature selection methods deal with selecting the most useful features to reduce computation burden, decrease the amount of confusion for the model, and make the learned model more unique. This is the focus of “Feature Selection”.

Another problem is the “unevenness” of the spectrum of the data matrix. When we train a linear classifier, we care not only that there is a general solution to the linear system, but also that we can find it easily. Typically, the training process employs a solver that works by calculating a gradient of the loss function and walking downhill in small steps. When some singular values are very large and others very close to zero, the solver needs to carefully step around the longer singular vectors (those that correspond to large singular values) and spend a lot of time digging around in the shorter singular vectors to find the true answer. This “unevenness” in the spectrum is measured by the condition number of the matrix, which is basically the ratio between the largest and the smallest absolute value of the singular values.

To summarize, in order for there to be a good linear model that is relatively unique, and in order for it to be easy to find, we wish for the following:

  1. The label vector can be well approximated by a linear combination of a subset of features (column vectors). Better yet, the set of features should be linearly independent.
  2. In order for the null space to be small, the row space must be large. (This is due to the fact that the two subspaces are orthogonal.) The more linearly independent the set of data points (row vectors), the smaller the null space.
  3. In order for the solution to be easy to find, the condition number of the data matrix—the ratio between the maximum and minimum singular values—should be small.

Bibliography

Strang, Gilbert. Linear Algebra and Its Applications. 4th ed. Boston, MA: Cengage Learning, 2006.

1 Strictly speaking, the formula given here is for linear regression, not linear classification. The difference is that regression allows for real-valued target variables, whereas classification targets are usually integers that represent different classes. A regressor can be turned into a classifier via a nonlinear transform. For instance, the logistic regression classifier passes the linear transform of the input through a logistic function. Such models are called generalized linear models and have linear functions at their core. Even though this example is about classification, we use the formula for linear regression as a teaching tool, because it is much easier to analyze. The intuitions readily map to generalized linear classifiers.

2 Actually, it’s a little more complicated than that. y may not be in the column space of A, so there may not be a solution to this equation. Instead of giving up, statistical machine learning looks for an approximate solution. It defines a loss function that quantifies the quality of a solution. If the solution is exact, then the loss is 0. Small errors, small loss; big errors, big loss, and so on. The training process then looks for the best parameters that minimize this loss function. In ordinary linear regression, the loss function is called the squared residual loss, which essentially maps y to the closest point in the column space of A. Logistic regression minimizes the log loss. In both cases, and linear models in general, the linear system Aw=y often lies at the core. Hence, our analysis here is very much relevant.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset