A classifier assigns inputs into one of the N classes based on some properties (features) of inputs. Classifiers have widespread applications such as e-mail spam filtering, finding most promising products, selecting customers for closer interactions, and taking decisions in machine learning situations, and so on. Let us explore how to implement a classifier using a large dataset. For instance, a spam filter will assign each e-mail to one of the two clusters—spam mail or not a spam mail.
There are many classification algorithms. One of the simplest, but effective algorithms is Naive Bayesian classifier that uses the Bayes theorem. You can find more information about Bayesian classifier from http://en.wikipedia.org/wiki/Naive_Bayes_classifier and Bayes theorem from http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/.
For this recipe, we will also focus on the Amazon purchase dataset as before. We will look at several features about a product such as number of reviews received, amount of positive ratings, and number of known similar items to identify a product as potential to be within the first 10,000 sales rank. We will use the Naive Bayesian classifier for classifications.
The following steps describe how to prepare to run Naive Bayesian example:
HADOOP_HOME
to refer to Hadoop installation directory.The following steps describe how to run Naive Bayesian example.
DATA_DIR
.HADOOP_HOME
. If the /data
directory is already there, clean it up.$ bin/hadoopdfs -mkdir /data $ bin/hadoopdfs -mkdir /data/input1 $ bin/hadoopdfs -put <DATA_DIR>/amazon-meta.txt /data/input1
chapter8.zip
). We will call that folder CHAPTER_8_SRC
.hadoop.home
property in the CHAPTER_8_SRC/build.xml
file to point to your Hadoop installation directory.ant build
command from the CHAPTER_8_SRC
directory.build/lib/hadoop-cookbook-chapter8.jar
to your HADOOP_HOME
.HADOOP_HOME
.$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.NavieBayesProductClassifer/data/input1 /data/output5
$ bin/hadoopdfs -cat /data/output5/*
postiveReviews>30 0.635593220338983 reviewCount>60 0.62890625 similarItemCount>150 0.5720620842572062
$ bin/hadoop-cp hadoop-cookbook-chapter8.jar chapter8.NavieBayesProductClassifer
The goal of this recipe is to look at some properties of a product and predict whether it will fall under the first 10,000 products at Amazon by the sales rank. We call these properties features, and for this sample we will focus on the following three properties:
p.reviewCount
)p.positiveReviews
)p.similarItemCount
)In the following discussion, we will write P(p.salesrank<1000) to mean that the probability that the given item p is within first 10,000 products and similar for other properties as well.
In this recipe, given a new product p, we want to find P(p.salesrank< 1000) based on statistical distribution features in the products. Furthermore, we need to use MapReduce for the calculations.
The first step is to understanding the equation for Naive Bayes Classifier. If Ap, Bp, and Cp are any three independent events (for example, A means p.reviewCount> 60) about a product p, and the following three a, b, and c are defined as follows, then we can write the equation for Naive Bayes Classifier.
a = P(Ap/ p.salesrank< 1000)
b = P(Bp/ p.salesrank< 1000)
c = P(Cp/ p.salesrank< 1000)
Here the slash suggests the conditional probability. For example, we read the first line as a is the probability of event Ap occurring given that p.salesrank< 1000 has already occurred.
Then using Bayes theorem we can write the following. The following equation provides the probability that the product will have a sales rank less than 1000 given three independent events Ap, Bp, and Cp.
P(p.salesrank< 1000/ Ap and Bp and Cp) = abc/(abc –(1-a).(1-b).(1-c)).
Now let us focus on the real calculation. As Ap, Bp, Cp, we will use following.
Then, we first run the MapReduce task to calculate the probabilities—a=P(Ap/p.salesrank<1000), b=P(Bp/p.salesrank<1000), and c=P(Cp/p.salesrank<1000). Then we will use those with above formula to classify a given product. You can find the source for the classifier from src/chapter8/NavieBayesProductClassifer.java
. The mapper function looks like the following:
private static final Text TOTAL = new Text("total"); private static final Text RCOUNT_GT_60 = new Text("reviewCount>60"); private static final Text PREIVEWS_GT_30 = new Text("postiveReviews>30"); private static final Text SIMILAR_ITEMS_GT_150 = new Text("similarItemCount>150"); public void map(Object key, Text value, Context context) throwsIOException, InterruptedException { List<AmazonCustomer>customerList = AmazonCustomer.parseAItemLine(value.toString()); intsalesRank = -1; intreviewCount = 0; intpostiveReviews = 0; intsimilarItemCount = 0; for (AmazonCustomer customer : customerList) { ItemDataitemData = customer.itemsBrought.iterator().next(); reviewCount++; if (itemData.rating> 3) { postiveReviews++; } similarItemCount = similarItemCount + itemData.similarItems.size(); if (salesRank == -1) { salesRank = itemData.salesrank; } } boolean isInFirst10k = (salesRank<= 10000); context.write(TOTAL, new BooleanWritable(isInFirst10k)); if (reviewCount> 60) { context.write(RCOUNT_GT_60, newBooleanWritable(isInFirst10k)); } if (postiveReviews> 30) { context.write(PREIVEWS_GT_30, newBooleanWritable(isInFirst10k)); } if (similarItemCount> 150) { context.write(SIMILAR_ITEMS_GT_150, newBooleanWritable(isInFirst10k)); } }
The mapper function walks thorugh each product and for each, it evaluates the features. If the feature evaluates to be true, it emits the feature name as the key and notifies whether the product is within the first 10,000 products as the value.
MapReduce invokes the reducer once for each feature. Then each reduce job receives all values for which the feature is true, and it calculates the probability that given the feature is true, the product is within the first 10,000 products in the sales rank.
public void reduce(Text key, Iterable<BooleanWritable> values, Context context) throws IOException, InterruptedException { int total = 0; int matches = 0; for (BooleanWritable value : values) { total++; if (value.get()) { matches++; } } context.write(new Text(key), newDoubleWritable((double) matches / total)); }
Given a product, we will examine and decide following:
We would use the above to decide what are the events A, B, C and we can calculate a, b, and c accordingly using P1, P2, and P3 calculated using MapReduce task. The following code implements this logic:
public static booleanclassifyItem(intsimilarItemCount, intreviewCount, intpostiveReviews) { double reviewCountGT60 = 0.8; double postiveReviewsGT30 = 0.9; double similarItemCountGT150 = 0.7; double a , b, c; if (reviewCount> 60) { a = reviewCountGT60; } else { a= 1 - reviewCountGT60; } if (postiveReviews> 30) { b = postiveReviewsGT30; } else { b = 1- postiveReviewsGT30; } if (similarItemCount> 150) { c = similarItemCountGT150; } else { c = 1- similarItemCountGT150; } double p = a*b*c/ (a*b*c + (1-a)*(1-b)*(1-c)); return p > 0.5; }
When you run the classifier testing the logic, it will load the data generated by the MapReduce job and classify 1000 randomly selected products.