K-means is a clustering algorithm. A clustering algorithm takes data points defined in an N-dimensional space, and groups them into multiple clusters considering the distance between those data points. A cluster is a set of data points such that the distance between the data points inside the cluster is much less than the distance from data points within the cluster to data points outside the cluster. More details about the K-means clustering can be found from the lecture 4 (http://www.youtube.com/watch?v=1ZDybXl212Q) of the Cluster computing and MapReduce lecture series by Google.
In this recipe, we will use a data set that includes Human Development Report (HDR) by country. HDR describes different countries based on several human development measures. You can find the data set from http://hdr.undp.org/en/statistics/data/.
This recipe will use K-means to cluster countries based on the HDR dimensions.
This recipe needs a Mahout installation. Follow the previous recipe if you have not already done so earlier.
This section demonstrates how to use Mahout K-means algorithm to process with a dataset.
SAMPLE5_DIR
.MAHOUT_HOME
to the mahout.home
property of build.xml file in the sample distribution.chapter5.KMeanSample.java
class shows a sample code for running the K-means algorithm using our own dataset.public final class KMean extends AbstractJob {
The following code initializes the K-means algorithm with right values
public static void main(String[] args) throws Exception { Path output = new Path("output"); Configuration conf = new Configuration(); HadoopUtil.delete(conf, output); run(conf, new Path("testdata"), output, newEuclideanDistanceMeasure(), 6, 0.5, 10); }
The following code shows how to set up K-means from Java code:
public static void run(Configuration conf, Path input, Path output,DistanceMeasure measure, int k, double convergenceDelta, intmaxIterations) throws Exception{ Path directoryContainingConvertedInput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("Preparing Input"); InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector"); log.info("Running random seed to get initial clusters"); Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); clusters = RandomSeedGenerator.buildRandom(conf, directoryContainingConvertedInput, clusters, k, measure); log.info("Running KMeans"); KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, measure, convergenceDelta, maxIterations, true, false); // run ClusterDumper ClusterDumperclusterDumper = new ClusterDumper( finalClusterPath(conf, output, maxIterations), new Path(output, "clusteredPoints")); clusterDumper.printClusters(null); } ... }
>ant mahout-build
resources/chapter5/countries4Kmean.data
to the MAHOUT_HOME/testdata
directory.>ant kmeans-run