Clustering plays an integral role in data mining computations. Clustering groups together similar items of a dataset by using one or more features of the data items based on the use-case. Document clustering is used in many text mining operations such as document organization, topic identification, information presentation, and so on. Document clustering shares many of the mechanisms and algorithms with traditional data clustering mechanisms. However, document clustering has its unique challenges when it comes to determining the features to use for clustering and when building vector space models to represent the text documents.
The Running K-Means with Mahout recipe of Chapter 5, Hadoop Ecosystem, focuses on using Mahout K-Means clustering from Java code to cluster a statistics data. The Hierarchical clustering and Clustering an Amazon sales dataset recipes of Chapter 8, Classifications, Recommendations, and Finding Relationships, focuses on using clustering to identify customers with similar interests. These three recipes provide a more in-depth understanding of using clustering algorithms in general. This recipe focuses on exploring two of the several clustering algorithms available in Apache Mahout for document clustering.
Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME
environment variable to point to your Hadoop installation root folder.
Download and install Apache Mahout. Export the MAHOUT_HOME
environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.
The following steps use the Apache Mahout K-Means clustering algorithm to cluster the 20news
dataset:
20news
dataset. We assume that the TF-IDF vectors are in the 20news-vector/tfidf-vectors
folder of the HDFS.MAHOUT_HOME
.>bin/mahout kmeans --input 20news-vector/tfidf-vectors --clusters 20news-seed/clusters --output 20news-km-clusters --distanceMeasure org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure-k 10 --maxIter 20 --clustering
>bin/mahout clusterdump -i20news-km-clusters/clusters-*-final -o 20news-clusters-dump -d 20news-vector/dictionary.file-0 -dt sequencefile --pointsDir 20news-km-clusters/clusteredPoints >cat 20news-clusters-dump
The following steps use the Apache Mahout MinHash clustering algorithm to cluster the 20news
dataset:
20news
data:>bin/mahout minhash --input 20news-vector/tfidf-vectors --output minhashout
HADOOP_HOME
and execute the following command to inspect the MinHash clustering results:>bin/hadoop dfs -cat minhashout/part*
The following is the usage of the Mahout K-Means algorithm:
>bin/mahout kmeans --input <tfidf vector input> --clusters <seed clusters> --output <HDFS path for output> --distanceMeasure<distance measure> -k <number of clusters> --maxIter<maximum number of iterations> --clustering
Mahout will generate random seed clusters when an empty HDFS folder path is given to the --clusters
option. Mahout supports several different distance calculation methods such as Euclidean, Cosine, Manhattan, and so on.
Following is the usage of the Mahout clusterdump
command:
>bin/mahout clusterdump -i <HDFS path to clusters> -o <local path for text output> -d <dictionary mapping for the vector data points> -dt <dictionary file type (sequencefile or text)> --pointsDir <directory containing the input vectors to clusters mapping>
Following is the usage of the Mahout MinHash clustering algorithm:
>bin/mahout minhash --input <tfidf vector input> --output <HDFS path for output>