Installing Mahout

Hadoop provides a framework for implementing large-scale data processing applications. Often, the users implement their applications on MapReduce from scratch or write their applications using a higher-level programming model such as Pig or Hive.

However, implementing some of the algorithms using MapReduce can be very complex. For example, algorithms such as collaborative filtering, clustering, and recommendations need complex code. This is further agitated by the need to maximize parallel executions.

Mahout is an effort to implement well-known machine learning and data mining algorithms using MapReduce framework, so that the users can reuse them in their data processing without having to rewrite them from the scratch. This recipe explains how to install Mahout.

How to do it...

This section demonstrates how to install Mahout.

  1. Download Mahout from https://cwiki.apache.org/confluence/display/MAHOUT/Downloads.
  2. Unzip the mahout distribution by running the following command. We will call this folder MAHOUT_HOME.
    >tar xvf mahout-distribution-0.6.tar.gz
    

You can run and verify the Mahout installation by carrying out the following steps:

  1. Download the input data from http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data and copy it to MAHOUT_HOME/testdata.
  2. Run the K-mean sample by running the following command:
    >bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
    

    If all goes well, it will process and print out the clusters:

    12/06/19 21:18:15 INFO kmeans.Job: Running with default arguments
    12/06/19 21:18:15 INFO kmeans.Job: Preparing Input
    12/06/19 21:18:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    .....
    2/06/19 21:19:38 INFO clustering.ClusterDumper: Wrote 6 clusters
    12/06/19 21:19:38 INFO driver.MahoutDriver: Program took 83559 ms (Minutes: 1.39265)
    

How it works...

Mahout is a collection of MapReduce jobs and you can run them using the mahout command. The preceding instructions installed and verified Mahout by running a K-means sample that comes with the Mahout distribution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset