Plotting the Hadoop results using GNU Plot

Although Hadoop jobs can generate interesting analytics, making sense of those results and getting a detailed understanding about the data often require us to see the overall trends in the data. We often do that by plotting the data.

The human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.

This recipe explains how to use GNU Plot, which is a free and powerful plotting program, to plot Hadoop results.

Getting ready

  • This recipe assumes that you have followed the previous recipe, Calculating frequency distributions and sorting using MapReduce. If you have not done so, please follow the recipe.
  • We will use the HADOOP_HOME variable to refer to the Hadoop installation folder.
  • Install the GNU Plot plotting program by following the instructions in http://www.gnuplot.info/.


How to do it...

The following steps show how to plot Hadoop job results using GNU Plot.

  1. Download the results of the last recipe to a local computer by running the following command from HADOOP_HOME:
    > bin/hadoopdfs -get /data/output3/part-r-00000 2.data
    
  2. Copy all the *.plot files from CHAPTER_6_SRC to HADOOP_HOME.
  3. Generate the plot by running the following command from HADOOP_HOME:
    >gnuplot httpfreqdist.plot
    
  4. It will generate a file called freqdist.png, which will look like the following:
    How to do it...

The preceding plot is plotted in log-log scale, and the first part of the distribution follows the zipf (power law) distribution, which is a common distribution seen in the web. The last few most popular links have much higher rates than expected from a zipf distribution.

Discussion about more details on this distribution is out of scope of this book. However, this plot demonstrates the kind of insights we can get by plotting the analytical results. In most of the future recipes, we will use the GNU plot to plot and to analyze the results.

How it works...

The following steps describe how plotting with GNU plot works:

  1. You can find the source for the GNU plot file from src/chapter6/resources/httpfreqdist.plot. The source for the plot will look like the following:
    set terminal png
    set output "freqdist.png"
    
    set title "Frequnecy Distribution of Hits by Url";
    set ylabel "Number of Hits";
    set xlabel "Urls (Sorted by hits)";
    set key left top
    set log y
    set log x
    
    plot"2.data" using 2 title "Frequency" with linespoints
  2. Here the first two lines define the output format. This example uses PNG, but GNU plot supports many other terminals like SCREEN, PDF, EPS, and so on.
  3. Next four lines define the axis labels and the title.
  4. Next two lines define the scale of each axis, and this plot uses log scale for both.
  5. Last line defines the plot. Here it is asking GNU plot to read the data from the 2.data file, and use the data in the second column of the file via using 2 and to plot it using lines. Columns must be separated by whitespaces.
  6. Here if you want to plot one column against other, for example, data from column 1 against column 2, you should write using 1:2 instead of using 2.

There's more...

You can learn more about GNU plot from http://www.gnuplot.info/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset