Classification assigns documents or data items to an already known set of classes with already known properties. Document classification or categorization is used when we need to assign documents to one or more categories. This is a frequent use-case in information retrieval as well as library science.
The Classification using Naive Bayes classifier recipe in Chapter 8, Classifications, Recommendations, and Finding Relationships, provides a more detailed description about classification use-cases and gives you an overview of using the Naive Bayes classifier algorithm. The recipe focuses on highlighting the classification support in Apache Mahout for text documents.
Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME
environment variable to point to your Hadoop installation root folder.
Download and install Apache Mahout. Export the MAHOUT_HOME
environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.
The following steps use the Apache Mahout Naive Bayes algorithm to cluster the 20news
dataset:
20news
dataset. We assume that the TF-IDF vectors are in the 20news-vector/tfidf-vectors
folder of the HDFS.MAHOUT_HOME
.>bin/mahout split -i 20news-vectors/tfidf-vectors --training Output/20news-train-vectors --test Output/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles
>bin/mahout trainnb -i 20news-train-vectors -el -o model -li labelindex
>bin/mahout testnb -i 20news-train-vectors -m model -l labelindex -o 20news-testing
The Mahout split
command can be used to split a dataset to a training dataset and a test dataset. The Mahout split
command works with text datasets as well as with Hadoop SequenceFile
datasets. Following is the usage of the Mahout split
command. You can use the --help
option with the split
command to print out all the options:
>bin/mahout split -i <input data directory> --trainingOutput<HDFS path to store the training dataset> --testOutput<HDFS path to store the test dataset> --randomSelectionPct<percentage to be selected as test data> --sequenceFiles
The sequenceFiles
option specifies that the input dataset is in Hadoop SequenceFiles
format.
Following is the usage of the Mahout Naive Bayes classifier training command. The -el
option informs Mahout to extract the labels from the input dataset:
>bin/mahout trainnb -i <HDFS path to the training dataset> -el -o <HDFS path to store the trained classifier model> -li <Path to store the label index>
Following is the usage of the Mahout Naive Bayes classifier testing command:
>bin/mahout testnb -i <HDFS path to the test dataset> -m <HDFS path to the classifier model> -l <Path to the label index> -o <path to store the test result>