Although Hadoop jobs can generate interesting analytics, making sense of those results and getting a detailed understanding about the data often require us to see the overall trends in the data. We often do that by plotting the data.
The human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.
This recipe explains how to use GNU Plot, which is a free and powerful plotting program, to plot Hadoop results.
HADOOP_HOME
variable to refer to the Hadoop installation folder.The following steps show how to plot Hadoop job results using GNU Plot.
HADOOP_HOME
:> bin/hadoopdfs -get /data/output3/part-r-00000 2.data
*.plot
files from CHAPTER_6_SRC
to HADOOP_HOME
.HADOOP_HOME
:>gnuplot httpfreqdist.plot
freqdist.png
, which will look like the following:The preceding plot is plotted in log-log scale, and the first part of the distribution follows the zipf (power law) distribution, which is a common distribution seen in the web. The last few most popular links have much higher rates than expected from a zipf distribution.
Discussion about more details on this distribution is out of scope of this book. However, this plot demonstrates the kind of insights we can get by plotting the analytical results. In most of the future recipes, we will use the GNU plot to plot and to analyze the results.
The following steps describe how plotting with GNU plot works:
src/chapter6/resources/httpfreqdist.plot
. The source for the plot will look like the following:set terminal png set output "freqdist.png" set title "Frequnecy Distribution of Hits by Url"; set ylabel "Number of Hits"; set xlabel "Urls (Sorted by hits)"; set key left top set log y set log x plot"2.data" using 2 title "Frequency" with linespoints
2.data
file, and use the data in the second column of the file via using 2
and to plot it using lines. Columns must be separated by whitespaces.1
against column 2
, you should write using 1:2
instead of using 2
.You can learn more about GNU plot from http://www.gnuplot.info/.