List of Tables
Chapter 2. Introducing recommenders
Table 2.1. An illustration of the average difference and root-mean-square calculation
Chapter 3. Representing recommender data
Table 3.1. Illustration of default table schema for taste_preferences in MySQL
Chapter 4. Making recommendations
Table 4.1. The Pearson correlation between user 1 and other users based on the three items that user 1 has in common with
the others
Table 4.2. The Euclidean distance between user 1 and other users, and the resulting similarity scores
Table 4.3. The preference values transformed into ranks, and the resulting Spearman correlation between user 1 and each of
the other users
Table 4.4. The similarity values between user 1 and other users, computed using the Tanimoto coefficient. Note that preference
values themselves are omitted, because they aren’t used in the computation.
Table 4.5. The similarity values between user 1 and other users, computed using the log-likelihood similarity metric
Table 4.6. Evaluation results under various ItemSimilarity metrics
Table 4.7. Average differences in preference values between all pairs of items. Cells along the diagonal are 0.0. Cells in
the bottom left are simply the negative of their counterparts across the diagonal, so these aren’t represented explicitly.
Some diffs don’t exist, such as 102-107, because no user expressed a preference for both 102 and 107.
Table 4.8. Summary of available recommender implementations in Mahout, their key input parameters, and key features to consider
when choosing an implementation
Chapter 5. Taking recommenders to production
Table 5.1. Average absolute difference in estimated and actual preferences when evaluating a user-based recommender using
one of several similarity metrics, and using a nearest-n user neighborhood
Table 5.2. Average absolute difference in estimated and actual preferences when evaluating a user-based recommender using
one of several similarity metrics, and using a threshold-based user neighborhood. Some values are “not a number,” or undefined,
and are denoted by Java’s NaN symbol.
Table 5.3. Average absolute differences in estimated and actual preferences, when evaluating an item-based recommender using
several different similarity metrics
Chapter 6. Distributing recommendation computations
Table 6.1. The co-occurrence matrix for items in a simple example data set. The first row and column are labels and not part
of the matrix.
Table 6.2. Multiplying the co-occurrence matrix with user 3’s preference vector (U3) to produce a vector that leads to recommendations,
R
Chapter 7. Introduction to clustering
Table 7.1. Result of clustering using various distance measures
Chapter 8. Representing data
Table 8.1. set of apples of different weight, sizes, and colors converted to vectors
Table 8.2. Important flags for the Mahout dictionary-based vectorizer and their default values
Chapter 9. Clustering algorithms in Mahout
Table 9.1. Top five words in selected topics from LDA topic modeling of Reuters news data
Table 9.2. Top five words in selected topics from LDA topic modeling after increased smoothing is applied
Table 9.3. The different clustering algorithms in Mahout, their entry-point classes, and their properties
Chapter 10. Evaluating and improving clustering quality
Table 10.1. Flags of the Mahout ClusterDumper tool and their default values
Chapter 13. Introduction to classification
Table 13.1. Mahout is most useful with extremely large or rapidly growing data sets where other solutions are least feasible.
Table 13.2. Terminology for the key ideas in classification
Table 13.3. Four common types of values used to represent features
Table 13.4. Sample data that illustrates all four value types. These examples are typical of features of email data.
Table 13.5. Workflow in a typical classification project
Table 13.6. Fields used in the donut.csv data file
Table 13.7. Command-line options for the trainlogistic program
Table 13.8. Command-line options for the runlogistic program
Chapter 14. Training a classifier
Table 14.1. Approaches to encoding classifiable data as a vector
Table 14.2. The most common headers found in the 20 newsgroups articles. A few of the less common headers are included as
well.
Table 14.3. Characteristics of the Mahout learning algorithms used for classification
Chapter 15. Evaluating and tuning a classifier
Table 15.1. Data from two hypothetical classifiers to show some of the limitations of just looking at percent correct. The
columns show the frequency of each possible model output. Each row contains data for a particular correct value, and the answer
with the highest score is in bold. Model 1 is never quite right, but may still be useful, whereas model 2 is like a stopped
clock.
Table 15.2. Mahout supports a variety of classifier performance metrics through multiple APIs.
Table 15.3. The Mahout classes that support performance evaluation for classifiers
Table 15.4. How bad tokens can defeat feature extraction
Table 15.5. Configuration methods for SGD learning classes
Appendix A. JVM tuning
Table A.1. Key JVM tuning parameters for recommender engines