Document classification using Mahout Naive Bayes classifier

Classification assigns documents or data items to an already known set of classes with already known properties. Document classification or categorization is used when we need to assign documents to one or more categories. This is a frequent use-case in information retrieval as well as library science.

The Classification using Naive Bayes classifier recipe in Chapter 8, Classifications, Recommendations, and Finding Relationships, provides a more detailed description about classification use-cases and gives you an overview of using the Naive Bayes classifier algorithm. The recipe focuses on highlighting the classification support in Apache Mahout for text documents.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Download and install Apache Mahout. Export the MAHOUT_HOME environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.

How to do it...

The following steps use the Apache Mahout Naive Bayes algorithm to cluster the 20news dataset:

  1. Follow the Creating TF and TF-IDF vectors for the text data recipe in this chapter and generate TF-IDF vectors for the 20news dataset. We assume that the TF-IDF vectors are in the 20news-vector/tfidf-vectors folder of the HDFS.
  2. Go to the MAHOUT_HOME.
  3. Split the data to training and test datasets:
    >bin/mahout split 
        -i 20news-vectors/tfidf-vectors 
        --training Output/20news-train-vectors 
        --test Output/20news-test-vectors  
        --randomSelectionPct 40 --overwrite --sequenceFiles
    
  4. Train the Naive Bayes model:
    >bin/mahout trainnb 
    -i 20news-train-vectors -el 
       -o  model 
       -li labelindex
    
  5. Test the classification on the test dataset:
    >bin/mahout testnb 
        -i 20news-train-vectors
        -m model 
        -l labelindex 
    -o 20news-testing
    

How it works...

The Mahout split command can be used to split a dataset to a training dataset and a test dataset. The Mahout split command works with text datasets as well as with Hadoop SequenceFile datasets. Following is the usage of the Mahout split command. You can use the --help option with the split command to print out all the options:

>bin/mahout split 
  -i <input data directory> 
--trainingOutput<HDFS path to store the training dataset> 
  --testOutput<HDFS path to store the test dataset>  
 --randomSelectionPct<percentage to be selected as test data> 
--sequenceFiles

The sequenceFiles option specifies that the input dataset is in Hadoop SequenceFiles format.

Following is the usage of the Mahout Naive Bayes classifier training command. The -el option informs Mahout to extract the labels from the input dataset:

>bin/mahout trainnb 
-i <HDFS path to the training dataset> 
-el 
  -o <HDFS path to store the trained classifier model> 
-li <Path to store the label index> 

Following is the usage of the Mahout Naive Bayes classifier testing command:

>bin/mahout testnb 
    -i <HDFS path to the test dataset>
    -m <HDFS path to the classifier model>
    -l <Path to the label index> 
    -o <path to store the test result>

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset