Classification using Naive Bayes Classifier

A classifier assigns inputs into one of the N classes based on some properties (features) of inputs. Classifiers have widespread applications such as e-mail spam filtering, finding most promising products, selecting customers for closer interactions, and taking decisions in machine learning situations, and so on. Let us explore how to implement a classifier using a large dataset. For instance, a spam filter will assign each e-mail to one of the two clusters—spam mail or not a spam mail.

There are many classification algorithms. One of the simplest, but effective algorithms is Naive Bayesian classifier that uses the Bayes theorem. You can find more information about Bayesian classifier from http://en.wikipedia.org/wiki/Naive_Bayes_classifier and Bayes theorem from http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/.

For this recipe, we will also focus on the Amazon purchase dataset as before. We will look at several features about a product such as number of reviews received, amount of positive ratings, and number of known similar items to identify a product as potential to be within the first 10,000 sales rank. We will use the Naive Bayesian classifier for classifications.

Getting ready

The following steps describe how to prepare to run Naive Bayesian example:

  1. This assumes that you have followed Chapter 1, Getting Hadoop up and running in a Cluster and have installed Hadoop. We will use the HADOOP_HOME to refer to Hadoop installation directory.
  2. Start Hadoop by following the instructions in the first chapter.
  3. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe from Chapter 1, Getting Hadoop up and running in a Cluster.

How to do it...

The following steps describe how to run Naive Bayesian example.

  1. Download the dataset from Amazon product co-purchasing network metadata, http://snap.stanford.edu/data/amazon-meta.html and unzip it. We will call this DATA_DIR.
  2. Upload the data to HDFS by running the following commands from HADOOP_HOME. If the /data directory is already there, clean it up.
    $ bin/hadoopdfs -mkdir /data
    $ bin/hadoopdfs -mkdir /data/input1
    $ bin/hadoopdfs -put <DATA_DIR>/amazon-meta.txt /data/input1
    
  3. Unzip the source code for chapter 8 (chapter8.zip). We will call that folder CHAPTER_8_SRC.
  4. Change the hadoop.home property in the CHAPTER_8_SRC/build.xml file to point to your Hadoop installation directory.
  5. Compile the source by running the ant build command from the CHAPTER_8_SRC directory.
  6. Copy build/lib/hadoop-cookbook-chapter8.jar to your HADOOP_HOME.
  7. Run the MapReduce job through the following command from HADOOP_HOME.
    $ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.NavieBayesProductClassifer/data/input1 /data/output5
    
  8. Read the results by running the following command.
    $ bin/hadoopdfs -cat /data/output5/*
    
  9. You will see that it will print the results as following. You can use these values with Bayesian classifier to classify the inputs.
    postiveReviews>30       0.635593220338983
    reviewCount>60  0.62890625
    similarItemCount>150    0.5720620842572062
  10. Verify the classifier using the following command.
    $ bin/hadoop-cp hadoop-cookbook-chapter8.jar chapter8.NavieBayesProductClassifer
    

How it works...

The goal of this recipe is to look at some properties of a product and predict whether it will fall under the first 10,000 products at Amazon by the sales rank. We call these properties features, and for this sample we will focus on the following three properties:

  • The number of review counts for a given a product (p.reviewCount)
  • The number of positive reviews for a given a product (p.positiveReviews)
  • The number of similar items for a given a product (p.similarItemCount)

In the following discussion, we will write P(p.salesrank<1000) to mean that the probability that the given item p is within first 10,000 products and similar for other properties as well.

In this recipe, given a new product p, we want to find P(p.salesrank< 1000) based on statistical distribution features in the products. Furthermore, we need to use MapReduce for the calculations.

The first step is to understanding the equation for Naive Bayes Classifier. If Ap, Bp, and Cp are any three independent events (for example, A means p.reviewCount> 60) about a product p, and the following three a, b, and c are defined as follows, then we can write the equation for Naive Bayes Classifier.

a = P(Ap/ p.salesrank< 1000)

b = P(Bp/ p.salesrank< 1000)

c = P(Cp/ p.salesrank< 1000)

Here the slash suggests the conditional probability. For example, we read the first line as a is the probability of event Ap occurring given that p.salesrank< 1000 has already occurred.

Then using Bayes theorem we can write the following. The following equation provides the probability that the product will have a sales rank less than 1000 given three independent events Ap, Bp, and Cp.

P(p.salesrank< 1000/ Ap and Bp and Cp) = abc/(abc –(1-a).(1-b).(1-c)).

Now let us focus on the real calculation. As Ap, Bp, Cp, we will use following.

  • Ap: This is the probability that given item has more than 60 reviews
  • Bp: This is the probability that given item has more than 30 positive reviews
  • Cp: This is the probability that given item has more than 150 similar items.

Then, we first run the MapReduce task to calculate the probabilities—a=P(Ap/p.salesrank<1000), b=P(Bp/p.salesrank<1000), and c=P(Cp/p.salesrank<1000). Then we will use those with above formula to classify a given product. You can find the source for the classifier from src/chapter8/NavieBayesProductClassifer.java. The mapper function looks like the following:

private static final Text TOTAL = new Text("total");
private static final Text RCOUNT_GT_60 = 
  new Text("reviewCount>60");
private static final Text PREIVEWS_GT_30 = 
  new Text("postiveReviews>30");
private static final Text SIMILAR_ITEMS_GT_150 = 
  new Text("similarItemCount>150");

public void map(Object key, Text value, Context context) throwsIOException, InterruptedException
  {
    List<AmazonCustomer>customerList = AmazonCustomer.parseAItemLine(value.toString());
    intsalesRank = -1;
    intreviewCount = 0;
    intpostiveReviews = 0;
    intsimilarItemCount = 0;

    for (AmazonCustomer customer : customerList)
    {
      ItemDataitemData = customer.itemsBrought.iterator().next();
      reviewCount++;
      if (itemData.rating> 3) 
      {
        postiveReviews++;
      }
      similarItemCount = similarItemCount + 
       itemData.similarItems.size();
    if (salesRank == -1) 
    {
      salesRank = itemData.salesrank;
    }
  }

  boolean isInFirst10k = (salesRank<= 10000);
  context.write(TOTAL, new BooleanWritable(isInFirst10k));
  if (reviewCount> 60) 
  {
    context.write(RCOUNT_GT_60, 
      newBooleanWritable(isInFirst10k));
  }
  if (postiveReviews> 30) 
  {
    context.write(PREIVEWS_GT_30, 
      newBooleanWritable(isInFirst10k));
  }
  if (similarItemCount> 150) 
  {
    context.write(SIMILAR_ITEMS_GT_150, 
      newBooleanWritable(isInFirst10k));
  }
}

The mapper function walks thorugh each product and for each, it evaluates the features. If the feature evaluates to be true, it emits the feature name as the key and notifies whether the product is within the first 10,000 products as the value.

MapReduce invokes the reducer once for each feature. Then each reduce job receives all values for which the feature is true, and it calculates the probability that given the feature is true, the product is within the first 10,000 products in the sales rank.

public void reduce(Text key, Iterable<BooleanWritable> values, Context context) throws IOException, InterruptedException 
{
  int total = 0;
  int matches = 0;
  for (BooleanWritable value : values) 
  {
    total++;
    if (value.get()) {
      matches++;
    }
  }
  context.write(new Text(key), 
    newDoubleWritable((double) matches / total));
}

Given a product, we will examine and decide following:

  • Does it have more than 60 reviews?
  • Does it have more than 30 positive reviews?
  • Does it have more than 150 positive items?

We would use the above to decide what are the events A, B, C and we can calculate a, b, and c accordingly using P1, P2, and P3 calculated using MapReduce task. The following code implements this logic:

public static booleanclassifyItem(intsimilarItemCount, intreviewCount, intpostiveReviews)
{
  double reviewCountGT60 = 0.8; 
  double postiveReviewsGT30 = 0.9; 
  double similarItemCountGT150 = 0.7; 
  double a , b, c; 

  if (reviewCount> 60) 
  {
    a = reviewCountGT60; 
  }
  else
  {
    a= 1 - reviewCountGT60; 
  }
  if (postiveReviews> 30) 
  {
    b = postiveReviewsGT30; 
  }
  else
  {
  b = 1- postiveReviewsGT30;
  }
  if (similarItemCount> 150) 
    {
      c = similarItemCountGT150; 
    }
    else
    {
      c = 1- similarItemCountGT150;
    }
    double p = a*b*c/ (a*b*c + (1-a)*(1-b)*(1-c)); 
    return p > 0.5; 
}

When you run the classifier testing the logic, it will load the data generated by the MapReduce job and classify 1000 randomly selected products.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset