Clustering the text data

Clustering plays an integral role in data mining computations. Clustering groups together similar items of a dataset by using one or more features of the data items based on the use-case. Document clustering is used in many text mining operations such as document organization, topic identification, information presentation, and so on. Document clustering shares many of the mechanisms and algorithms with traditional data clustering mechanisms. However, document clustering has its unique challenges when it comes to determining the features to use for clustering and when building vector space models to represent the text documents.

The Running K-Means with Mahout recipe of Chapter 5, Hadoop Ecosystem, focuses on using Mahout K-Means clustering from Java code to cluster a statistics data. The Hierarchical clustering and Clustering an Amazon sales dataset recipes of Chapter 8, Classifications, Recommendations, and Finding Relationships, focuses on using clustering to identify customers with similar interests. These three recipes provide a more in-depth understanding of using clustering algorithms in general. This recipe focuses on exploring two of the several clustering algorithms available in Apache Mahout for document clustering.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Download and install Apache Mahout. Export the MAHOUT_HOME environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.

How to do it...

The following steps use the Apache Mahout K-Means clustering algorithm to cluster the 20news dataset:

  1. Follow the Creating TF and TF-IDF vectors for the text data recipe in this chapter and generate TF-IDF vectors for the 20news dataset. We assume that the TF-IDF vectors are in the 20news-vector/tfidf-vectors folder of the HDFS.
  2. Go to the MAHOUT_HOME.
  3. Execute the following command to execute the K-Means clustering computation:
    >bin/mahout kmeans 
      --input 20news-vector/tfidf-vectors 
      --clusters 20news-seed/clusters 
      --output 20news-km-clusters 
      --distanceMeasure 
      org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure-k 10 --maxIter 20 --clustering
    
  4. Execute the following command to convert the clusters to text:
    >bin/mahout clusterdump 
      -i20news-km-clusters/clusters-*-final
      -o 20news-clusters-dump 
      -d 20news-vector/dictionary.file-0 
      -dt sequencefile 
      --pointsDir 20news-km-clusters/clusteredPoints
    
    >cat 20news-clusters-dump
    

The following steps use the Apache Mahout MinHash clustering algorithm to cluster the 20news dataset:

  1. Execute the following command to run MinHash clustering on an already vectorised 20news data:
    >bin/mahout minhash 
      --input 20news-vector/tfidf-vectors 
      --output minhashout
    
  2. Go to HADOOP_HOME and execute the following command to inspect the MinHash clustering results:
    >bin/hadoop dfs -cat minhashout/part*
    

How it works...

The following is the usage of the Mahout K-Means algorithm:

>bin/mahout kmeans
  --input <tfidf vector input>
  --clusters <seed clusters>
  --output <HDFS path for output>
  --distanceMeasure<distance measure>
  -k <number of clusters>
  --maxIter<maximum number of iterations>
  --clustering

Mahout will generate random seed clusters when an empty HDFS folder path is given to the --clusters option. Mahout supports several different distance calculation methods such as Euclidean, Cosine, Manhattan, and so on.

Following is the usage of the Mahout clusterdump command:

>bin/mahout clusterdump
  -i <HDFS path to clusters>
  -o <local path for text output>
  -d <dictionary mapping for the vector data points>
  -dt <dictionary file type (sequencefile or text)>
  --pointsDir <directory containing the input vectors to
                  clusters mapping>

Following is the usage of the Mahout MinHash clustering algorithm:

>bin/mahout minhash
     --input  <tfidf vector input>
     --output <HDFS path for output>

See also

  • Running K-Means with Mahout in Chapter 5, Hadoop Ecosystem.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset