Running K-means with Mahout

K-means is a clustering algorithm. A clustering algorithm takes data points defined in an N-dimensional space, and groups them into multiple clusters considering the distance between those data points. A cluster is a set of data points such that the distance between the data points inside the cluster is much less than the distance from data points within the cluster to data points outside the cluster. More details about the K-means clustering can be found from the lecture 4 (http://www.youtube.com/watch?v=1ZDybXl212Q) of the Cluster computing and MapReduce lecture series by Google.

In this recipe, we will use a data set that includes Human Development Report (HDR) by country. HDR describes different countries based on several human development measures. You can find the data set from http://hdr.undp.org/en/statistics/data/.

This recipe will use K-means to cluster countries based on the HDR dimensions.

Getting ready

This recipe needs a Mahout installation. Follow the previous recipe if you have not already done so earlier.

How to do it...

This section demonstrates how to use Mahout K-means algorithm to process with a dataset.

  1. Unzip the sample code distribution. We will call this SAMPLE5_DIR.
  2. Add the MAHOUT_HOME to the mahout.home property of build.xml file in the sample distribution.
  3. The chapter5.KMeanSample.java class shows a sample code for running the K-means algorithm using our own dataset.
    public final class KMean extends AbstractJob {

    The following code initializes the K-means algorithm with right values

    public static void main(String[] args) throws Exception
    {
      Path output = new Path("output");
      Configuration conf = new Configuration();
      HadoopUtil.delete(conf, output);
      run(conf, new Path("testdata"), output, 
        newEuclideanDistanceMeasure(), 6, 0.5, 10);
    }

    The following code shows how to set up K-means from Java code:

    public static void run(Configuration conf, Path input, 
    Path output,DistanceMeasure measure, int k, double convergenceDelta, intmaxIterations)
    throws Exception{
        Path directoryContainingConvertedInput = new Path(output,
            DIRECTORY_CONTAINING_CONVERTED_INPUT);
    log.info("Preparing Input");
    InputDriver.runJob(input, 
    directoryContainingConvertedInput,
            "org.apache.mahout.math.RandomAccessSparseVector");
    log.info("Running random seed to get initial clusters");
        Path clusters = new Path(output, 
    Cluster.INITIAL_CLUSTERS_DIR);
    clusters = RandomSeedGenerator.buildRandom(conf,
    directoryContainingConvertedInput, clusters, 
    k, measure);
    log.info("Running KMeans");
    KMeansDriver.run(conf, directoryContainingConvertedInput, 
    clusters, output,
    measure, convergenceDelta, maxIterations, true, false);
        // run ClusterDumper
    ClusterDumperclusterDumper = new ClusterDumper(
    finalClusterPath(conf,
    	output, maxIterations), 
    new Path(output, "clusteredPoints"));
    clusterDumper.printClusters(null);
      }
      ...
    }
  4. Compile the sample by running the following command:
    >ant mahout-build
    
  5. From samples, copy the file resources/chapter5/countries4Kmean.data to the MAHOUT_HOME/testdata directory.
  6. Run the sample by running the following command.
    >ant kmeans-run
    

How it works...

The preceding sample shows how you can configure and use K-means implementation from Java. When we run the code, it initializes the K-means MapReduce job and executes it using the MapReduce framework.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset