1.2. Mahout’s machine learning themes
1.3. Tackling large scale with Mahout and Hadoop
Chapter 2. Introducing recommenders
2.2. Running a first recommender engine
2.3.1. Training data and scoring
2.4. Evaluating precision and recall
2.5. Evaluating the GroupLens data set
Chapter 3. Representing recommender data
3.1. Representing preference data
3.1.2. PreferenceArray and implementations
3.3. Coping without preference values
Chapter 4. Making recommendations
4.1. Understanding user-based recommendation
4.2. Exploring the user-based recommender
4.2.2. Implementing the algorithm with GenericUserBasedRecommender
4.2.3. Exploring with GroupLens
4.2.4. Exploring user neighborhoods
4.3. Exploring similarity metrics
4.3.1. Pearson correlation–based similarity
4.3.2. Pearson correlation problems
4.3.4. Defining similarity by Euclidean distance
4.3.5. Adapting the cosine measure similarity
4.3.6. Defining similarity by relative rank with the Spearman correlation
4.3.7. Ignoring preference values in similarity with the Tanimoto coefficient
4.3.8. Computing smarter similarity with a log-likelihood test
4.4. Item-based recommendation
4.6. New and experimental recommenders
4.6.1. Singular value decomposition–based recommenders
4.7. Comparison to other recommenders
Chapter 5. Taking recommenders to production
5.1. Analyzing example data from a dating site
5.2. Finding an effective recommender
5.2.1. User-based recommenders
5.2.2. Item-based recommenders
5.3. Injecting domain-specific information
5.3.1. Employing a custom item similarity metric
5.3.2. Recommending based on content
5.3.3. Modifying recommendations with IDRescorer
5.4. Recommending to anonymous users
5.5. Creating a web-enabled recommender
Chapter 6. Distributing recommendation computations
6.1. Analyzing the Wikipedia data set
6.1.2. Evaluating benefits and drawbacks of distributing computations
6.2. Designing a distributed item-based algorithm
6.2.1. Constructing a co-occurrence matrix
6.2.3. Producing the recommendations
6.3. Implementing a distributed algorithm with MapReduce
6.3.2. Translating to MapReduce: generating user vectors
6.3.3. Translating to MapReduce: calculating co-occurrence
6.3.4. Translating to MapReduce: rethinking matrix multiplication
6.3.5. Translating to MapReduce: matrix multiplication by partial products
6.4. Running MapReduces with Hadoop
6.5. Pseudo-distributing a recommender
6.6. Looking beyond first steps with recommendations
Chapter 7. Introduction to clustering
7.2. Measuring the similarity of items
7.3. Hello World: running a simple clustering example
7.4. Exploring distance measures
7.4.1. Euclidean distance measure
7.4.2. Squared Euclidean distance measure
7.4.3. Manhattan distance measure
7.4.4. Cosine distance measure
7.5. Hello World again! Trying out various distance measures
8.2. Representing text documents as vectors
8.2.1. Improving weighting with TF-IDF
8.2.2. Accounting for word dependencies with n-gram collocations
8.3. Generating vectors from documents
Chapter 9. Clustering algorithms in Mahout
9.1.1. All you need to know about k-means
9.1.2. Running k-means clustering
9.2. Beyond k-means: an overview of clustering techniques
9.3.1. Running fuzzy k-means clustering
9.3.2. How fuzzy is too fuzzy?
9.3.3. Case study: clustering news articles using fuzzy k-means
9.5. Topic modeling using latent Dirichlet allocation (LDA)
9.5.1. Understanding latent Dirichlet analysis
9.5.3. Tuning the parameters of LDA
Chapter 10. Evaluating and improving clustering quality
10.1. Inspecting clustering output
10.2. Analyzing clustering output
10.2.1. Distance measure and feature selection
10.3. Improving clustering quality
Chapter 11. Taking clustering to production
11.1. Quick-start tutorial for running clustering on Hadoop
11.2. Tuning clustering performance
11.2.1. Avoiding performance pitfalls in CPU-bound operations
11.2.2. Avoiding performance pitfalls in I/O-bound operations
11.3. Batch and online clustering
Chapter 12. Real-world applications of clustering
12.1. Finding similar users on Twitter
12.2. Suggesting tags for artists on Last.fm
12.2.1. Tag suggestion using co-occurrence
12.2.2. Creating a dictionary of Last.fm artists
12.2.3. Converting Last.fm tags into Vectors with musicians as features
12.3. Analyzing the Stack Overflow data set
Chapter 13. Introduction to classification
13.1. Why use Mahout for classification?
13.2. The fundamentals of classification systems
13.2.1. Differences between classification, recommendation, and clustering
13.3. How classification works
13.3.2. Training versus test versus production
13.3.3. Predictor variables versus target variable
13.3.4. Records, fields, and values
13.4. Work flow in a typical classification project
13.4.1. Workflow for stage 1: training the classification model
13.4.2. Workflow for stage 2: evaluating the classification model
13.5. Step-by-step simple classification example
13.5.1. The data and the challenge
13.5.2. Training a model to find color-fill: preliminary thinking
Chapter 14. Training a classifier
14.1. Extracting features to build a Mahout classifier
14.2. Preprocessing raw data into classifiable data
14.3. Converting classifiable data into vectors
14.4. Classifying the 20 newsgroups data set with SGD
14.4.1. Getting started: previewing the data set
14.4.2. Parsing and tokenizing features for the 20 newsgroups data
14.5. Choosing an algorithm to train the classifier
14.5.1. Nonparallel but powerful: using SGD and SVM
14.5.2. The power of the naive classifier: using naive Bayes and complementary naive Bayes
14.5.3. Strength in elaborate structure: using random forests
14.6. Classifying the 20 newsgroups data with naive Bayes
14.6.1. Getting started: data extraction for naive Bayes
Chapter 15. Evaluating and tuning a classifier
15.1. Classifier evaluation in Mahout
15.1.1. Getting rapid feedback
15.2. The classifier evaluation API
15.2.2. Confusion matrices and entropy matrices
15.2.3. Computing average log likelihood
15.2.5. Performance of the SGD classifier with 20 newsgroups
15.4. Tuning for better performance
Chapter 16. Deploying a classifier
16.1. Process for deployment in huge systems
16.1.2. Optimize feature extraction as needed
16.2. Determining scale and speed requirements
16.3. Building a training pipeline for large systems
16.3.1. Acquiring and retaining large-scale data
16.4. Integrating a Mahout classifier
16.5. Example: a Thrift-based classification server
Chapter 17. Case study: Shop It To Me
17.1. Why Shop It To Me chose Mahout
17.1.1. What Shop It To Me does
17.2. General structure of the email marketing system
17.3.1. Defining the goal of the classification project
17.4. Speeding up classification