You can’t do machine learning without math. In particular, linear algebra and calculus are essential. The goal of this appendix is to provide enough mathematical background to help you understand the code samples in the book. We don’t have nearly enough space to cover these massive topics thoroughly; if you want to understand these subjects better, we provide some suggestions for further reading.
If you’re already familiar with advanced machine-learning techniques, you can safely skip this appendix altogether.
In this book, we have room to cover only a few mathematical basics. If you’re interested in learning more about the mathematical foundations of machine learning, here are some suggestions:
Linear algebra provides tools for handling arrays of data known as vectors, matrices, and tensors. You can represent all of these objects in Python with NumPy’s array type.
Linear algebra is fundamental to machine learning. This section covers only the most basic operations, with a focus on how to implement them in NumPy.
A vector is a one-dimensional array of numbers. The size of the array is the dimension of the vector. You use NumPy arrays to represent vectors in Python code.
This isn’t the true mathematical definition of a vector, but for the purposes of our book, it’s close enough.
You can convert a list of numbers into a NumPy array with the np.array function. The shape attribute lets you check the dimension:
>>> import numpy as np >>> x = np.array([1, 2]) >>> x array([1, 2]) >>> x.shape (2,) >>> y = np.array([3, 3.1, 3.2, 3.3]) >>> y array([3. , 3.1, 3.2, 3.3]) >>> y.shape (4,)
Note that shape is always a tuple; this is because arrays can be multidimensional, as you’ll see in the next section.
You can access individual elements of a vector, just as if it were a Python list:
>>> x = np.array([5, 6, 7, 8]) >>> x[0] 5 >>> x[1] 6
Vectors support a few basic algebraic operations. You can add two vectors of the same dimension. The result is a third vector of the same dimension. Each element of the sum vector is the sum of the matching elements in the original vectors:
>>> x = np.array([1, 2, 3, 4]) >>> y = np.array([5, 6, 7, 8]) >>> x + y array([ 6, 8, 10, 12])
Similarly, you can also multiply two vectors element-wise with the * operator. (Here, element-wise means you multiply each pair of corresponding elements separately.)
>>> x = np.array([1, 2, 3, 4]) >>> y = np.array([5, 6, 7, 8]) >>> x * y array([ 5, 12, 21, 32])
The element-wise product is also called the Hadamard product.
You can also multiply a vector with a single float (or scalar). In this case, you multiply each value in the vector by the scalar:
>>> x = np.array([1, 2, 3, 4]) >>> 0.5 * x array([0.5, 1. , 1.5, 2. ])
Vectors support a third kind of multiplication, the dot product, or inner product. To compute the dot product, you multiply each pair of corresponding elements and sum the results. So the dot product of two vectors is a single float. The NumPy function np.dot calculates the dot product. In Python 3.5 and later, the @ operator does the same thing. (In this book, we use np.dot.)
>>> x = np.array([1, 2, 3, 4]) >>> y = np.array([4, 5, 6, 7]) >>> np.dot(x, y) 60 >>> x @ y 60
A two-dimensional array of numbers is called a matrix. You can also represent matrices with NumPy arrays. In this case, if you pass a list of lists into np.array, you get a two-dimensional matrix back:
>>> x = np.array([ [1, 2, 3], [4, 5, 6] ]) >>> x array([[1, 2, 3], [4, 5, 6]]) >>> x.shape (2, 3)
Note that the shape of a matrix is a two-element tuple: first is the number of rows, and second is the number of columns. You can access single elements with a double-subscript notation: first row, then column. Alternately, NumPy lets you pass in the indices in a [row, column] format. Both are equivalent:
>>> x = np.array([ [1, 2, 3], [4, 5, 6] ]) >>> x[0][1] 2 >>> x[0, 1] 2 >>> x[1][0] 4 >>> x[1, 0] 4
You can also pull out a whole row from a matrix and get a vector:
>>> x = np.array([ [1, 2, 3], [4, 5, 6] ]) >>> y = x[0] >>> y array([1, 2, 3]) >>> y.shape (3,)
To pull out a column, you can use the funny-looking notation [:, n]. If it helps, think of : as Python’s list-slicing operator; so [:, n] means “get me all rows, but only column n.” Here’s an example:
>>> x = np.array([ [1, 2, 3], [4, 5, 6] ]) >>> z = x[:, 1] >>> z array([2, 5])
Just like vectors, matrices support element-wise addition, element-wise multiplication, and scalar multiplication:
>>> x = np.array([ [1, 2, 3], [4, 5, 6] ]) >>> y = np.array([ [3, 4, 5], [6, 7, 8] ]) >>> x + y array([[ 4, 6, 8], [10, 12, 14]]) >>> x * y array([[ 3, 8, 15], [24, 35, 48]]) >>> 0.5 * x array([[0.5, 1. , 1.5], [2. , 2.5, 3. ]])
Go is played on a grid; so are chess, checkers, and a variety of other classic games. Any point on the grid can contain one of a variety of different game pieces. How do you represent the contents of the board as a mathematical object? One solution is to represent the board as a stack of matrices, and each matrix is the size of the game board.
Each individual matrix in the stack is called a plane, or channel. Each channel can represent a single type of piece that can be on the game board. In Go, you might have one channel for black stones and a second channel for white stones; figure A.1 shows an example. In chess, maybe you have a channel for pawns, another channel for bishops, one for knights, and so forth. You can represent the whole stack of matrices as a single three-dimensional array; this is called a rank 3 tensor.
Another common case is representing an image. Let’s say you want to represent a 128 × 64 pixel image with a NumPy array. In that case, you start with a grid corresponding to the pixels in the image. In computer graphics, you typically break a color into red, green, and blue components. So you can represent that image with a 3 × 128 × 64 tensor: you have a red channel, a green channel, and a blue channel.
Once again, you can use np.array to construct a tensor. The shape will be a tuple with three components, and you can use subscripting to pull out individual channels:
>>> x = np.array([ [[1, 2, 3], [2, 3, 4]], [[3, 4, 5], [4, 5, 6]] ]) >>> x.shape (2, 2, 3) >>> x[0] array([[1, 2, 3], [2, 3, 4]]) >>> x[1] array([[3, 4, 5], [4, 5, 6]])
As with vectors and matrices, tensors support element-wise addition, element-wise multiplication, and scalar multiplication.
If you have an 8 × 8 grid with three channels, you could represent it with a 3 × 8 × 8 tensor or an 8 × 8 × 3 tensor. The only difference is in the way you index it. When you process the tensors with library functions, you must make sure the functions are aware of which indexing scheme you chose. The Keras library, which you use for designing neural networks, calls these two options channels_first and channels_last. For the most part, the choice doesn’t matter: you just need to pick one and stick to it consistently. In this book, we use the channels_first format.
If you need a motivation to pick a format, certain NVIDIA GPUs have special optimizations for the channels_first format.
In many places in the book, we use a rank 3 tensor to represent a game board. For efficiency, you may want to pass many game boards to a function at once. One solution is to pack the board tensors into a four-dimensional NumPy array: this is a tensor of rank 4. You can think of this four-dimensional array as a list of rank 3 tensors, each of which represents a single board.
Matrices and vectors are just special cases of tensors: a matrix is a rank 2 tensor, and a vector is a rank 1 tensor. And a rank 0 tensor is a plain old number.
Rank 4 tensors are the highest-order tensors you’ll see in this book, but NumPy can handle tensors of any rank. Visualizing high-dimensional tensors is hard, but the algebra is the same.
In calculus, the rate of change of a function is called its derivative. Table A.1 lists a few real-world examples.
Quantity |
Derivative |
---|---|
How far you’ve traveled | How fast you moved |
How much water is in a tub | How fast the water drained out |
How many customers you have | How many customers you gain (or lose) |
A derivative is not a fixed quantity: it’s another function that varies over time or space. On a trip in a car, you drive faster or slower at various times. But your speed is always connected to the distance you cover. If you had a precise record of where you were over time, you could go back and work out how fast you were traveling at any point in the trip. That is the derivative.
When a function is increasing, its derivative is positive. When a function is decreasing, its derivative is negative. Figure A.2 illustrates this concept. With this knowledge, you can use the derivative to find a local maximum or a local minimum. Any place the derivative is positive, you can move to the right a little bit and find a larger value. If you go past the maximum, the function must now be decreasing, and its derivative is negative. In that case, you want to move a little bit to the left. At the local maximum, the derivative will be exactly zero. The logic for finding a local minimum is identical, except you move in the opposite direction.
Many functions that show up in machine learning take a high-dimensional vector as input and compute a single number as output. You can extend the same idea to maximize or minimize such a function. The derivative of such a function is a vector of the same dimension as its input, called a gradient. For every element of the gradient, the sign tells you which direction to move that coordinate. Following the gradient to maximize a function is called gradient ascent; if you’re minimizing, the technique is called gradient descent.
In this case, it may help to imagine the function as a contoured surface. At any point, the gradient points to the steepest slope of the surface.
To use gradient ascent, you must have a formula for the derivative of the function you’re trying to maximize. Most simple algebraic functions have a known derivative; you can look them up in any calculus textbook. If you define a complicated function by chaining many simple functions together, a formula known as the chain rule describes how to calculate the derivative of the complicated function. Libraries like TensorFlow and Theano take advantage of the chain rule to automatically calculate the derivative of complicated functions. If you define a complicated function in Keras, you don’t need to figure out the formula for the gradient yourself: Keras will hand off the work to TensorFlow or Theano.