Sentiment classification is a classification process that tries to determine a person's propensity to like or dislike certain items. In this recipe, we will use a naive Bayes classifier from Apache Mahout to determine if a set of terms found in a movie review mean the movie had a negative or positive reception.
You will need to download, compile, and install the following:
Polarity_dataset_v2.0
from http://www.cs.cornell.edu/people/pabo/movie-review-data/Extract the movie review dataset review_polarity.tar.gz
to the folder you are currently working on. You should see a newly created folder named txt_sentoken
. Within that folder there should be two more folders named pos
and neg
. The pos
and neg
folders hold text files containing the written reviews of movies. Obviously, the pos
folder contains positive movie reviews, and the neg
folder contains negative reviews.
reorg_data.py
script from the folder you are currently working on to transform the data into training and test sets for the Mahout classifier:$ ./reorg_data.py txt_sentoken train test
This application will read and write to the local filesystem, and not HDFS.
$ mahout prepare20newsgroups -p train -o train_formated -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 $ mahout prepare20newsgroups -p test -o test_formated -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
train_formated
and test_formated
folders into HDFS:$ hadoop fs –put train_formated /user/hadoop/ $ hadoop fs –put test_formated /user/hadoop/
train_formated
dataset:$ mahout trainclassifier -i /user/hadoop/train_formated -o /user/hadoop/reviews/naive-bayes-model -type bayes -ng 2 -source hdfs
test_formated
dataset:$ mahout testclassifier -m /user/hadoop/reviews/naive-bayes-model -d prepared-test -type bayes -ng 2 -source hdfs -method sequential
testclassifier
tool should return a similar summary and confusion matrix. The numbers will not be exactly the same as the ones shown in the following:Summary ------------------------------------------------------- Correctly Classified Instances : 285 71.25% Incorrectly Classified Instances : 115 28.75% Total Classified Instances : 400 ======================================================= Confusion Matrix ------------------------------------------------------- a b <--Classified as 97 103 | 200 a = pos 12 188 | 200 b = neg
The first two steps required us to prepare the data for the Mahout naive Bayes classifier. The reorg_data.py
script distributed the positive and negative reviews from the txt_sentoken
folder into a training and test set. 80 percent of the reviews were placed into the training set, and the remaining 20 percent were used as a test set. Next, we used the prepare20newsgroups
tool to format the training and test datasets into a format compatible with the Mahout classifier. The example dataset included in Mahout has a similar format to the data produced by the reorg_data.py
script, thus we can use the prepare20newsgroups
tool. All that the prepare20newsgroups
does is to combine all of the files in the pos
and neg
folders into a single file based on the dataset class (negative or positive). So, instead of having 1000 positive and negative files, where each file contained a single review, we now have two files named pos.txt
and neg.txt
, where each contains all of the positive and negative reviews.
Next, we trained a naive Bayes classifier using the n-gram size of 2
, specified with the –ng
flag, using the train_formated
dataset in HDFS. Mahout trains the classifier by launching a series of MapReduce jobs.
Finally, we ran the testclassifier
tool to test the classifier we created in step 4, against the test_formated
data in HDFS. As we can see from step 6, we correctly classified 71.25 percent of the test data. It is important to note that this statistic does not mean the classifier will be accurate 71.25 percent of the time for every movie review ever. There are a number of ways in which classifiers can be trained and validated. Those techniques go beyond the scope of this book.
The testclassifier
tool we used in step 6, did not run a MapReduce job. It tested the classifier in local mode. If we wanted to test the classifier using MapReduce, we just need to change the -method
parameter to mapreduce
.
$ mahout testclassifier -m /user/hadoop/reviews/naive-bayes-model -d prepared-test -type bayes -ng 2 -source hdfs -method mapreduce