Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Document classification using Mahout Naive Bayes classifier

Classification assigns documents or data items to an already known set of classes with already known properties. Document classification or categorization is used when we need to assign documents to one or more categories. This is a frequent use-case in information retrieval as well as library science.

The Classification using Naive Bayes classifier recipe in Chapter 8, Classifications, Recommendations, and Finding Relationships, provides a more detailed description about classification use-cases and gives you an overview of using the Naive Bayes classifier algorithm. The recipe focuses on highlighting the classification support in Apache Mahout for text documents.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Download and install Apache Mahout. Export the MAHOUT_HOME environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.

How to do it...

The following steps use the Apache Mahout Naive Bayes algorithm to cluster the 20news dataset:

Follow the Creating TF and TF-IDF vectors for the text data recipe in this chapter and generate TF-IDF vectors for the 20news dataset. We assume that the TF-IDF vectors are in the 20news-vector/tfidf-vectors folder of the HDFS.
Go to the MAHOUT_HOME.

Split the data to training and test datasets:

>bin/mahout split 
    -i 20news-vectors/tfidf-vectors 
    --training Output/20news-train-vectors 
    --test Output/20news-test-vectors  
    --randomSelectionPct 40 --overwrite --sequenceFiles

Train the Naive Bayes model:

>bin/mahout trainnb 
-i 20news-train-vectors -el 
   -o  model 
   -li labelindex

Test the classification on the test dataset:

>bin/mahout testnb 
    -i 20news-train-vectors
    -m model 
    -l labelindex 
-o 20news-testing

How it works...

The Mahout split command can be used to split a dataset to a training dataset and a test dataset. The Mahout split command works with text datasets as well as with Hadoop SequenceFile datasets. Following is the usage of the Mahout split command. You can use the --help option with the split command to print out all the options:

>bin/mahout split 
  -i <input data directory> 
--trainingOutput<HDFS path to store the training dataset> 
  --testOutput<HDFS path to store the test dataset>  
 --randomSelectionPct<percentage to be selected as test data> 
--sequenceFiles

The sequenceFiles option specifies that the input dataset is in Hadoop SequenceFiles format.

Following is the usage of the Mahout Naive Bayes classifier training command. The -el option informs Mahout to extract the labels from the input dataset:

>bin/mahout trainnb 
-i <HDFS path to the training dataset> 
-el 
  -o <HDFS path to store the trained classifier model> 
-li <Path to store the label index>

Following is the usage of the Mahout Naive Bayes classifier testing command:

>bin/mahout testnb 
    -i <HDFS path to the test dataset>
    -m <HDFS path to the classifier model>
    -l <Path to the label index> 
    -o <path to store the test result>

Table of Contents for
Document classification using Mahout Naive Bayes classifier

Document classification using Mahout Naive Bayes classifier

Getting ready

How to do it...

How it works...

See also

Table of Contents for Document classification using Mahout Naive Bayes classifier

Create new playlist

Sign In

Sign Up

Document classification using Mahout Naive Bayes classifier

Getting ready

How to do it...

How it works...

See also

Table of Contents for
Document classification using Mahout Naive Bayes classifier