Appendix B. Mahout math

Many algorithms implemented by Mahout rely heavily on vector and matrix math. Mahout has its own self-contained vector and matrix math library in the math module that can be reused separately from Mahout itself. It is actually an adapted form of CERN’s Colt library (http://dsd.lbl.gov/~hoschek/colt/).

This appendix provides a practical overview of key parts of this math module that users encounter frequently while using Mahout: Vector and Matrix instances, and their associated operations. It is not exhaustive documentation; the Mahout project Javadoc and source code can provide more detail if you’re interested.

B.1. Vectors

Vector means slightly different things in different contexts, though all of them are similar. Most people encounter a vector first in physics, where it denotes a direction and is typically illustrated by an arrow. For our purposes, we sometimes use vectors to represent points instead. This isn’t such a different idea; a point merely corresponds to a vector from the origin to that point.

Most often in machine learning and Mahout, we use vectors in a more abstract sense, one that doesn’t necessarily map to a geometric interpretation. A vector can be merely a tuple, an ordered list of numbers. It has a size (or cardinality), and at each index (position) from 0 to size – 1 the vector has some numeric value. Vectors are usually written as a list of numbers like so: (2.3, 1.55, 0.0). This is a vector of size 3, whose value at index 1 is 1.55, for example. In machine learning, we sometimes deal with vectors with sizes in the millions.

B.1.1. Vector implementation

Vector in org.apache.mahout.math is the interface that all implementations implement. There are several implementations of this interface. Representing vectors in a memory-efficient way, and accessing their values efficiently, are quite important to Mahout, and the best way to do these things varies according to the nature of the vector and how it’ll be used.

The sparseness or denseness of a vector’s data is the most important consideration. In many cases, a vector has a (nonzero) value at only a few indices relative to its size. The value at all other indices is implicitly zero. If we had a vector of size 1,000 and only 2 indices had a nonzero value, it wouldn’t make much sense to store 998 zeroes in memory along with those 2 nonzero values. But it wouldn’t seem so wasteful if only 100 values were zero.

A vector with relatively many nonzero values in relation to its size is called dense. Such a vector can be represented with an implementation that stores values in an array of doubles. Vector indices correspond directly to array indices. A dense vector’s advantage is speed: being array-backed, it’s quick to access and update any of its values. DenseVector provides such an implementation.

Sparse vectors call for a different implementation. In order to not consume memory storing values for zero entries, they instead only store values and indices with nonzero values. A hash table could be used to implement a simple index-to-value mapping. Accessing an index value is slower than with direct array access, but not by much. RandomAccessSparseVector implements this idea.

The problem with a hash-backed implementation is that it becomes relatively slow to iterate through all values in order by index, and this is frequently required. An ordered mapping based on a tree structure or something similar can address this problem, because it maintains keys in order. The price of this feature is longer access time. SequentialAccessSparseVector is the third and final implementation, providing a Vector with these characteristics. Figure B.1 shows an example of dense and sparse vectors.

Figure B.1. A dense vector is a represented as a full array, so it needs to store only the double values. A sparse vector doesn’t allocate the space needed for zero-valued cells, so it keeps an integer to indicate the index of every nonzero double value. Stored values are boxed; others are implicit and aren’t stored explicitly.

B.1.2. Vector operations

Now we’ll examine the frequently-used methods of Vector.

The get(int) and set(int, double) methods provide the basic operations to access and change vector values at an index. The getQuick(int) and setQuick(int, double) methods do the same, but without bounds checking, which becomes a useful performance optimization internally. Other callers should likely stick to the first two methods. The size() method, as expected, returns the vector’s size.

Iterating over all elements is accomplished through iterator() and iterateNonZero(), which iterate over all indices and values, or only those with a nonzero value. The Iterator returned from these methods provides Element objects, encapsulating an index and value.

The standard clone() method is implemented here to copy a Vector; implementations provide copy constructors as well. The like() method returns an empty Vector of the same class. This is a form of factory method.

Finally, we can talk about where the mathematics comes into this math module. The plus(Vector) and minus(Vector) methods provide vector standard addition and subtraction. You can multiply or divide all values by a scalar with times (double) or divide (double). Also note methods like dot (Vector), which computes a dot product (inner product), and cross(Vector), which creates a matrix from the pair-wise value products of two vectors.

B.1.3. Advanced Vector methods

So far, all of this should be straightforward and as expected. The first non-trivial method in Vector is assign(Vector, DoubleDoubleFunction). It mutates the Vector and sets its value equal to the result of some function of the values in the Vector and another Vector. DoubleDoubleFunction encapsulates a function of two numeric values that results in another numeric value. This method doesn’t let you do things you couldn’t otherwise accomplish with the previous basic operations, but it may let you do them more efficiently.

For example, given vectors A and B, consider computing C = mA – B, where m is a scalar value. This could be implemented as in myOperation.

Listing B.1. Computing Vector operations efficiently
Vector myOperation(Vector A, Vector B, double m) {
  return A.times(m).minus(B);
}

Vector myFasterOperation(Vector A, Vector B, final double m) {
  A.assign(B, new DoubleDoubleFunction() {
    public double apply(double a, double b) {
      return a * m - b;
    }
  });
}

Both myOperation and myFasterOperation accomplish the task. The first is simple, but slower than it appears. Each of the two mathematical operations allocates an entirely new Vector, and the Vectors are traversed twice. The second method, however, doesn’t allocate a new Vector—it destructively changes A, which may be entirely fine and desirable. It also needs only one pass over the vectors to compute the result.

The aggregate(DoubleDoubleFunction, DoubleFunction) method can simplify computing a function of all values in a vector, especially when used in conjunction with ready-made functions in Functions. For instance, to compute the sum of squares of values in a Vector, you could make use of either method in the following listing.

Listing B.2. Computing aggregations efficiently
double mySumOfSquares(Vector A) {
  double sum = 0.0;
  for (Element e : A.iterateNonZero()) {
    sum += e.get() * e.get();
  }
  return sum;
}

double myOtherSumOfSquares(Vector A) {
  return A.aggregate(Functions.PLUS, Functions.SQUARE);
}

The second method is not, in this case, more efficient, but it is more compact.

B.2. Matrices

Matrices, unlike vectors, are unambiguous and likely familiar—you’ll likely have at least glimpsed matrix operations in a mathematics course. A matrix is a table of numbers. This simple construct has proved tremendously useful for representing concepts in physics and linear algebra—and by extension machine learning. Operations on matrices are a fundamental part of algorithms in these fields, and Mahout uses matrices in many places. Often, in Mahout, it’ll be useful to think of the rows or columns of a matrix as vectors in their own right.

A matrix is represented by an implementation of the Matrix interface in Mahout. Its implementation follows the same pattern as Vector: several different classes implement Matrix in order to provide performance characteristics optimized for different situations and usage patterns. The same sparseness consideration similarly applies to matrices. SparseMatrix and DenseMatrix represent matrices with few (nonzero) values, and with many values relative to their size, respectively.

SparseRowMatrix and SparseColumnMatrix are interesting variants on SparseMatrix that are intended for use where rows (or columns) of a matrix will frequently be accessed as a unit, as a vector. SparseRowMatrix supports use cases where a matrix has relatively few nonzero rows, but rows must be accessed as row vectors. SparseColumnMatrix behaves similarly for columns.

B.2.1. Matrix operations

Many of the methods found in Vector also exist in Matrix, in a slightly different form, to accommodate the matrix’s two-dimensional nature. The get(int, int) method returns the value at a given row and column, for example. The size() method returns both the row and column size of the matrix. The like() and clone() methods are implemented as well. Label bindings for rows and columns are supported, as are convenience operations like assign(double).

Matrix implementations provide mathematical operations like plus(Matrix) and times(Matrix) to implement matrix addition and multiplication. Other common matrix-specific operations, like transpose() to generate the transposed matrix, and determinant() to compute the determinant, exist as well.

Matrix operations are largely unsurprising, but they make it easy to implement otherwise complex operations in a compact way. For example, the following listing illustrates multiplying an m x n matrix against a vector of size n (which can be thought of as the lone column vector of an n x 1 matrix).

Listing B.3. Multiplying a matrix and a vector
Vector vector = new SequentialAccessSparseVector(n);
Matrix matrix = new SparseMatrix(new int[] {m, n});
Matrix vectorAsMatrix = new SparseColumnMatrix(new int[] {n, 1});
vectorAsMatrix.assignColumn(0, vector);
Matrix productMatrix = matrix.times(vectorAsMatrix);
Vector product = productMatrix.getColumn(0);

This sort of operation occurs frequently in linear algebra and machine learning, and you can see how relatively simple it can be to implement it with Mahout’s math module.

B.3. Mahout math and Hadoop

Matrix and Vector implementations are frequently used in Hadoop-related mapper and reducer jobs. This means they must necessarily be serializable—convertible into a sequence of bytes that can be stored, transmitted, and reconstituted. Hadoop doesn’t employ Java’s standard serialization mechanism via java.io.Serializable to accomplish this. Instead, it defines a very similar interface called Writable for this purpose. Classes implement Writable in order to make themselves usable as keys and values in Hadoop.

Vector and Matrix do not themselves extend Writable, because this would make the math module depend on Hadoop, and conceptually it functions independently of something like Hadoop. Within the core Mahout module, you will find instead VectorWritable and MatrixWritable. These implement both Writable and Vector or Matrix, respectively. They encapsulate knowledge of how to serialize a vector and matrix within Hadoop. In Mapper and Reducer jobs that read or write vectors or matrices, raw implementations will be wrapped in such classes before passing them to Hadoop, and they’ll be returned in this form.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset