The Hadoop distribution comes with several benchmarks . We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them.
Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup.
Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter
Hadoop job and then sort them using the sort sample.
HADOOP_HOME
.randomwriter
Hadoop job using the following command:>bin/hadoop jar hadoop-examples-1.0 .0.jarrandomwriter -Dtest.randomwrite.bytes_per_map=100 -Dtest.randomwriter.maps_per_host=10 /data/unsorted-data
Here the two parameters, test.randomwrite.bytes_per_map
and test.randomwriter.maps_per_host
specify the size of data generated by a map and the number of maps respectively.
>bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data /data/sorted-data
>bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /data/unsorted-data -sortOutput /data/sorted-data
Finally, when everything is successful, the following message will be displayed:
The job took 66 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully.
First, the randomwriter
application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort
job. If your computer has more capacity, you may run the initial randomwriter
step with increased output sizes.
Hadoop includes several other benchmarks.
More information about these benchmarks can be found at http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/.