Hadoop provides a framework for implementing large-scale data processing applications. Often, the users implement their applications on MapReduce from scratch or write their applications using a higher-level programming model such as Pig or Hive.
However, implementing some of the algorithms using MapReduce can be very complex. For example, algorithms such as collaborative filtering, clustering, and recommendations need complex code. This is further agitated by the need to maximize parallel executions.
Mahout is an effort to implement well-known machine learning and data mining algorithms using MapReduce framework, so that the users can reuse them in their data processing without having to rewrite them from the scratch. This recipe explains how to install Mahout.
This section demonstrates how to install Mahout.
MAHOUT_HOME
.>tar xvf mahout-distribution-0.6.tar.gz
You can run and verify the Mahout installation by carrying out the following steps:
MAHOUT_HOME/testdata
.>bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
If all goes well, it will process and print out the clusters:
12/06/19 21:18:15 INFO kmeans.Job: Running with default arguments 12/06/19 21:18:15 INFO kmeans.Job: Preparing Input 12/06/19 21:18:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. ..... 2/06/19 21:19:38 INFO clustering.ClusterDumper: Wrote 6 clusters 12/06/19 21:19:38 INFO driver.MahoutDriver: Program took 83559 ms (Minutes: 1.39265)