Benchmarking HDFS

Running benchmarks is a good way to verify whether your HDFS cluster is set up properly and performs as expected. DFSIO is a benchmark test that comes with Hadoop, which can be used to analyze the I/O performance of a HDFS cluster. This recipe shows how to use DFSIO to benchmark the read and write performance of a HDFS cluster.

Getting ready

You must set up and deploy HDFS and Hadoop MapReduce prior to running these benchmarks. Export the HADOOP_HOME environment variable to point to your Hadoop installation root directory:

>export HADOOP_HOME=/../hadoop-1.0.4

The benchmark programs are in the $HADOOP_HOME/hadoop-*test.jar file.

How to do it...

The following steps show you how to run the write performance benchmark:

  1. To run the write performance benchmark, execute the following command in the $HADOOP_HOME directory. The –nrFiles parameter specifies the number of files and the -fileSize parameter specifies the file size in MB.
    >bin/hadoop jar $HADOOP_HOME/hadoop-test-*.jar TestDFSIO -write -nrFiles 5 –fileSize 100
    
  2. The benchmark writes to the console, as well as appends to a file named TestDFSIO_results.log. You can provide your own result file name using the –resFile parameter.

The following steps show you how to run the read performance benchmark:

  1. The read performance benchmark uses the files written by the write benchmark in step 1. Hence, the write benchmark should be executed before running the read benchmark and the files written by the write benchmark should exist in the HDFS for the read benchmark to work.
  2. Execute the following command to run the read benchmark. Benchmark writes the results to the console and appends the results to a logfile similarly to the write benchmark.
    >bin/hadoop jar $HADOOP_HOME/hadoop-test-*.jar TestDFSIO -read -nrFiles5 –fileSize 100
    

To clean the files generated by these benchmarks, use the following command:

>bin/hadoop jar $HADOOP_HOME hadoop-test-*.jar TestDFSIO -clean

How it works...

DFSIO executes a MapReduce job where the map tasks write and read the files in parallel, while the reduce tasks are used to collect and summarize the performance numbers.

There's more...

Running these tests together with monitoring systems can help you identify the bottlenecks much more easily.

See also

  • The Running benchmarks to verify the Hadoop installation recipe in Chapter 3, Advanced Hadoop MapReduce Administration.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset