Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Preface

Acknowledgments

About this Book

About Multimedia Extras

About the Cover Illustration

Chapter 1. Meet Apache Mahout

1.1. Mahout’s story

1.2. Mahout’s machine learning themes

1.2.1. Recommender engines

1.2.2. Clustering

1.2.3. Classification

1.3. Tackling large scale with Mahout and Hadoop

1.4. Setting up Mahout

1.4.1. Java and IDEs

1.4.2. Installing Maven

1.4.3. Installing Mahout

1.4.4. Installing Hadoop

1.5. Summary

1. Recommendations

Chapter 2. Introducing recommenders

2.1. Defining recommendation

2.2. Running a first recommender engine

2.2.1. Creating the input

2.2.2. Creating a recommender

2.2.3. Analyzing the output

2.3. Evaluating a recommender

2.3.1. Training data and scoring

2.3.2. Running RecommenderEvaluator

2.3.3. Assessing the result

2.4. Evaluating precision and recall

2.4.1. Running RecommenderIRStatsEvaluator

2.4.2. Problems with precision and recall

2.5. Evaluating the GroupLens data set

2.5.1. Extracting the recommender input

2.5.2. Experimenting with other recommenders

2.6. Summary

Chapter 3. Representing recommender data

3.1. Representing preference data

3.1.1. The Preference object

3.1.2. PreferenceArray and implementations

3.1.3. Speeding up collections

3.1.4. FastByIDMap and FastIDSet

3.2. In-memory DataModels

3.2.1. GenericDataModel

3.2.2. File-based data

3.2.3. Refreshable components

3.2.4. Update files

3.2.5. Database-based data

3.2.6. JDBC and MySQL

3.2.7. Configuring via JNDI

3.2.8. Configuring programmatically

3.3. Coping without preference values

3.3.1. When to ignore values

3.3.2. In-memory representations without preference values

3.3.3. Selecting compatible implementations

3.4. Summary

Chapter 4. Making recommendations

4.1. Understanding user-based recommendation

4.1.1. When recommendation goes wrong

4.1.2. When recommendation goes right

4.2. Exploring the user-based recommender

4.2.1. The algorithm

4.2.2. Implementing the algorithm with GenericUserBasedRecommender

4.2.3. Exploring with GroupLens

4.2.4. Exploring user neighborhoods

4.2.5. Fixed-size neighborhoods

4.2.6. Threshold-based neighborhood

4.3. Exploring similarity metrics

4.3.1. Pearson correlation–based similarity

4.3.2. Pearson correlation problems

4.3.3. Employing weighting

4.3.4. Defining similarity by Euclidean distance

4.3.5. Adapting the cosine measure similarity

4.3.6. Defining similarity by relative rank with the Spearman correlation

4.3.7. Ignoring preference values in similarity with the Tanimoto coefficient

4.3.8. Computing smarter similarity with a log-likelihood test

4.3.9. Inferring preferences

4.4. Item-based recommendation

4.4.1. The algorithm

4.4.2. Exploring the item-based recommender

4.5. Slope-one recommender

4.5.1. The algorithm

4.5.2. Slope-one in practice

4.5.3. DiffStorage and memory considerations

4.5.4. Distributing the precomputation

4.6. New and experimental recommenders

4.6.1. Singular value decomposition–based recommenders

4.6.2. Linear interpolation item–based recommendation

4.6.3. Cluster-based recommendation

4.7. Comparison to other recommenders

4.7.1. Injecting content-based techniques into Mahout

4.7.2. Looking deeper into content-based recommendation

4.8. Comparison to model-based recommenders

4.9. Summary

Chapter 5. Taking recommenders to production

5.1. Analyzing example data from a dating site

5.2. Finding an effective recommender

5.2.1. User-based recommenders

5.2.2. Item-based recommenders

5.2.3. Slope-one recommender

5.2.4. Evaluating precision and recall

5.2.5. Evaluating Performance

5.3. Injecting domain-specific information

5.3.1. Employing a custom item similarity metric

5.3.2. Recommending based on content

5.3.3. Modifying recommendations with IDRescorer

5.3.4. Incorporating gender in an IDRescorer

5.3.5. Packaging a custom recommender

5.4. Recommending to anonymous users

5.4.1. Temporary users with PlusAnonymousUserDataModel

5.4.2. Aggregating anonymous users

5.5. Creating a web-enabled recommender

5.5.1. Packaging a WAR file

5.5.2. Testing deployment

5.6. Updating and monitoring the recommender

5.7. Summary

Chapter 6. Distributing recommendation computations

6.1. Analyzing the Wikipedia data set

6.1.1. Struggling with scale

6.1.2. Evaluating benefits and drawbacks of distributing computations

6.2. Designing a distributed item-based algorithm

6.2.1. Constructing a co-occurrence matrix

6.2.2. Computing user vectors

6.2.3. Producing the recommendations

6.2.4. Understanding the results

6.2.5. Towards a distributed implementation

6.3. Implementing a distributed algorithm with MapReduce

6.3.1. Introducing MapReduce

6.3.2. Translating to MapReduce: generating user vectors

6.3.3. Translating to MapReduce: calculating co-occurrence

6.3.4. Translating to MapReduce: rethinking matrix multiplication

6.3.5. Translating to MapReduce: matrix multiplication by partial products

6.3.6. Translating to MapReduce: making recommendations

6.4. Running MapReduces with Hadoop

6.4.1. Setting up Hadoop

6.4.2. Running recommendations with Hadoop

6.4.3. Configuring mappers and reducers

6.5. Pseudo-distributing a recommender

6.6. Looking beyond first steps with recommendations

6.6.1. Running in the cloud

6.6.2. Imagining unconventional uses of recommendations

6.7. Summary

2. Clustering

Chapter 7. Introduction to clustering

7.1. Clustering basics

7.2. Measuring the similarity of items

7.3. Hello World: running a simple clustering example

7.3.1. Creating the input

7.3.2. Using Mahout clustering

7.3.3. Analyzing the output

7.4. Exploring distance measures

7.4.1. Euclidean distance measure

7.4.2. Squared Euclidean distance measure

7.4.3. Manhattan distance measure

7.4.4. Cosine distance measure

7.4.5. Tanimoto distance measure

7.4.6. Weighted distance measure

7.5. Hello World again! Trying out various distance measures

7.6. Summary

Chapter 8. Representing data

8.1. Visualizing vectors

8.1.1. Transforming data into vectors

8.1.2. Preparing vectors for use by Mahout

8.2. Representing text documents as vectors

8.2.1. Improving weighting with TF-IDF

8.2.2. Accounting for word dependencies with n-gram collocations

8.3. Generating vectors from documents

8.4. Improving quality of vectors using normalization

8.5. Summary

Chapter 9. Clustering algorithms in Mahout

9.1. K-means clustering

9.1.1. All you need to know about k-means

9.1.2. Running k-means clustering

9.1.3. Finding the perfect k using canopy clustering

9.1.4. Case study: clustering news articles using k-means

9.2. Beyond k-means: an overview of clustering techniques

9.2.1. Different kinds of clustering problems

9.2.2. Different clustering approaches

9.3. Fuzzy k-means clustering

9.3.1. Running fuzzy k-means clustering

9.3.2. How fuzzy is too fuzzy?

9.3.3. Case study: clustering news articles using fuzzy k-means

9.4. Model-based clustering

9.4.1. Deficiencies of k-means

9.4.2. Dirichlet clustering

9.4.3. Running a model-based clustering example

9.5. Topic modeling using latent Dirichlet allocation (LDA)

9.5.1. Understanding latent Dirichlet analysis

9.5.2. TF-IDF vs. LDA

9.5.3. Tuning the parameters of LDA

9.5.4. Case study: finding topics in news documents

9.5.5. Applications of topic modeling

9.6. Summary

Chapter 10. Evaluating and improving clustering quality

10.1. Inspecting clustering output

10.2. Analyzing clustering output

10.2.1. Distance measure and feature selection

10.2.2. Inter-cluster and intra-cluster distances

10.2.3. Mixed and overlapping clusters

10.3. Improving clustering quality

10.3.1. Improving document vector generation

10.3.2. Writing a custom distance measure

10.4. Summary

Chapter 11. Taking clustering to production

11.1. Quick-start tutorial for running clustering on Hadoop

11.1.1. Running clustering on a local Hadoop cluster

11.1.2. Customizing Hadoop configurations

11.2. Tuning clustering performance

11.2.1. Avoiding performance pitfalls in CPU-bound operations

11.2.2. Avoiding performance pitfalls in I/O-bound operations

11.3. Batch and online clustering

11.3.1. Case study: online news clustering

11.3.2. Case study: clustering Wikipedia articles

11.4. Summary

Chapter 12. Real-world applications of clustering

12.1. Finding similar users on Twitter

12.1.1. Data preprocessing and feature weighting

12.1.2. Avoiding common pitfalls in feature selection

12.2. Suggesting tags for artists on Last.fm

12.2.1. Tag suggestion using co-occurrence

12.2.2. Creating a dictionary of Last.fm artists

12.2.3. Converting Last.fm tags into Vectors with musicians as features

12.2.4. Running k-means over the Last.fm data

12.3. Analyzing the Stack Overflow data set

12.3.1. Parsing the Stack Overflow data set

12.3.2. Finding clustering problems in Stack Overflow

12.4. Summary

3. Classification

Chapter 13. Introduction to classification

13.1. Why use Mahout for classification?

13.2. The fundamentals of classification systems

13.2.1. Differences between classification, recommendation, and clustering

13.2.2. Applications of classification

13.3. How classification works

13.3.1. Models

13.3.2. Training versus test versus production

13.3.3. Predictor variables versus target variable

13.3.4. Records, fields, and values

13.3.5. The four types of values for predictor variables

13.3.6. Supervised versus unsupervised learning

13.4. Work flow in a typical classification project

13.4.1. Workflow for stage 1: training the classification model

13.4.2. Workflow for stage 2: evaluating the classification model

13.4.3. Workflow for stage 3: using the model in production

13.5. Step-by-step simple classification example

13.5.1. The data and the challenge

13.5.2. Training a model to find color-fill: preliminary thinking

13.5.3. Choosing a learning algorithm to train the model

13.5.4. Improving performance of the color-fill classifier

13.6. Summary

Chapter 14. Training a classifier

14.1. Extracting features to build a Mahout classifier

14.2. Preprocessing raw data into classifiable data

14.2.1. Transforming raw data

14.2.2. Computational marketing example

14.3. Converting classifiable data into vectors

14.3.1. Representing data as a vector

14.3.2. Feature hashing with Mahout APIs

14.4. Classifying the 20 newsgroups data set with SGD

14.4.1. Getting started: previewing the data set

14.4.2. Parsing and tokenizing features for the 20 newsgroups data

14.4.3. Training code for the 20 newsgroups data

14.5. Choosing an algorithm to train the classifier

14.5.1. Nonparallel but powerful: using SGD and SVM

14.5.2. The power of the naive classifier: using naive Bayes and complementary naive Bayes

14.5.3. Strength in elaborate structure: using random forests

14.6. Classifying the 20 newsgroups data with naive Bayes

14.6.1. Getting started: data extraction for naive Bayes

14.6.2. Training the naive Bayes classifier

14.6.3. Testing a naive Bayes model

14.7. Summary

Chapter 15. Evaluating and tuning a classifier

15.1. Classifier evaluation in Mahout

15.1.1. Getting rapid feedback

15.1.2. Deciding what “good” means

15.1.3. Recognizing the difference in cost of errors

15.2. The classifier evaluation API

15.2.1. Computation of AUC

15.2.2. Confusion matrices and entropy matrices

15.2.3. Computing average log likelihood

15.2.4. Dissecting a model

15.2.5. Performance of the SGD classifier with 20 newsgroups

15.3. When classifiers go bad

15.3.1. Target leaks

15.3.2. Broken feature extraction

15.4. Tuning for better performance

15.4.1. Tuning the problem

15.4.2. Tuning the classifier

15.5. Summary

Chapter 16. Deploying a classifier

16.1. Process for deployment in huge systems

16.1.1. Scope out the problem

16.1.2. Optimize feature extraction as needed

16.1.3. Optimize vector encoding as needed

16.1.4. Deploy a scalable classifier service

16.2. Determining scale and speed requirements

16.2.1. How big is big?

16.2.2. Balancing big versus fast

16.3. Building a training pipeline for large systems

16.3.1. Acquiring and retaining large-scale data

16.3.2. Denormalizing and downsampling

16.3.3. Training pitfalls

16.3.4. Reading and encoding data at speed

16.4. Integrating a Mahout classifier

16.4.1. Plan ahead: key issues for integration

16.4.2. Model serialization

16.5. Example: a Thrift-based classification server

16.5.1. Running the classification server

16.5.2. Accessing the classifier service

16.6. Summary

Chapter 17. Case study: Shop It To Me

17.1. Why Shop It To Me chose Mahout

17.1.1. What Shop It To Me does

17.1.2. Why Shop It To Me needed a classification system

17.1.3. Mahout outscales the rest

17.2. General structure of the email marketing system

17.3. Training the model

17.3.1. Defining the goal of the classification project

17.3.2. Partitioning by time

17.3.3. Avoiding target leaks

17.3.4. Learning algorithm tweaks

17.3.5. Feature vector encoding

17.4. Speeding up classification

17.4.1. Linear combination of feature vectors

17.4.2. Linear expansion of model score

17.5. Summary

Appendix A. JVM tuning

Appendix B. Mahout math

B.1. Vectors

B.1.1. Vector implementation

B.1.2. Vector operations

B.1.3. Advanced Vector methods

B.2. Matrices

B.2.1. Matrix operations

B.3. Mahout math and Hadoop

C. Resources

Sources

Index

List of Figures

List of Tables

List of Listings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset