This chapter provides a basic understanding of the Big Data ecosystem and an example to analyze data sitting on the Hadoop framework using Pentaho. At the end of this chapter, you will learn how to translate diverse data sets into meaningful data sets using Hadoop/Hive.
In this chapter we will cover the following topics:
Today, Big Data (http://en.wikipedia.org/wiki/Big_data) and Hadoop (http://hadoop.apache.org) have almost become synonyms of each other. Theoretically, the former is a notion whereas the latter is a software platform.
Whenever we think of massive amounts of data, Google immediately pops up in our head. In fact, Big Data was first recognized in its true sense by Google in 2004, and a white paper was written on Google File System (GFS) and MapReduce; two years later, Hadoop was born. Similarly, after Google published the open source projects Sawzall and BigTable, Pig, Hive, and HBase were born. Even in the future, Google is going to drive this story forward.
Big Data is a combination of data management technologies that have evolved over a period of time. Big Data is a term used to define a large collection of data (or data sets) that can be structured, unstructured, or mixed, and quickly grows so large that it becomes difficult to manage using conventional databases or statistical tools. Another way to define this term is any data source that has at least three of the following shared characteristics (known as 3Vs):
Sometimes, two more Vs are added for variability and value. Some interesting statistics of data explosion are as follows:
Interestingly, 80 percent of Big Data is unstructured, and businesses now need fast, reliable, and deeper data insight.
Hadoop, an open source project from Apache Software Foundation, has become the de facto standard for storing, processing, and analyzing hundreds of terabytes, even petabytes of data. This framework was originally developed by Doug Cutting and Mike Cafarella in 2005, and named it after Doug's son's toy elephant. Written in Java, this framework is optimized to handle massive amounts of structured/unstructured data through parallelism using MapReduce on GoogleFS with the help of inexpensive commodity hardware.
Hadoop is a highly scalable model and supports unlimited linear scalability. It runs on commodity hardware, which is 1/10th of the enterprise hardware, and uses open source software. So, it can run 10 times faster at the same cost. It is distributed and reliable: by default, it keeps three times data redundancy, which can be further configured.
Hadoop, over a period of time, has become a full-fledged ecosystem by adding lots of new open source friends such as Hive, Pig, HBase, and ZooKeeper.
There are many Internet or social networking companies such as Yahoo!, Facebook, Amazon, eBay, Twitter, and LinkedIn that use Hadoop. Yahoo! Search Webmap was the largest Hadoop application when it went into production in 2008, with more than 10,000 core Linux clusters. As of today, Yahoo! has more than 40,000 nodes running in more than 20 Hadoop clusters.
Facebook's Hadoop clusters include the largest single HDFS (Hadoop Distributed File System) cluster known, with more than 100 PB physical disk space in a single HDFS filesystem.