Chapter 3. Churning Big Data with Pentaho

This chapter provides a basic understanding of the Big Data ecosystem and an example to analyze data sitting on the Hadoop framework using Pentaho. At the end of this chapter, you will learn how to translate diverse data sets into meaningful data sets using Hadoop/Hive.

In this chapter we will cover the following topics:

  • Overview of Big Data and Hadoop
  • Hadoop architecture
  • Big Data capabilities of Pentaho Data Integration (PDI)
  • Working with PDI and Hortonworks Data Platform, a Hadoop distribution
  • Loading data from HDFS to Hive using PDI
  • Query data using Hive's SQL-like language

An overview of Big Data and Hadoop

Today, Big Data (http://en.wikipedia.org/wiki/Big_data) and Hadoop (http://hadoop.apache.org) have almost become synonyms of each other. Theoretically, the former is a notion whereas the latter is a software platform.

Big Data

Whenever we think of massive amounts of data, Google immediately pops up in our head. In fact, Big Data was first recognized in its true sense by Google in 2004, and a white paper was written on Google File System (GFS) and MapReduce; two years later, Hadoop was born. Similarly, after Google published the open source projects Sawzall and BigTable, Pig, Hive, and HBase were born. Even in the future, Google is going to drive this story forward.

Big Data is a combination of data management technologies that have evolved over a period of time. Big Data is a term used to define a large collection of data (or data sets) that can be structured, unstructured, or mixed, and quickly grows so large that it becomes difficult to manage using conventional databases or statistical tools. Another way to define this term is any data source that has at least three of the following shared characteristics (known as 3Vs):

  • Extremely large volume of data
  • Extremely high velocity of data
  • Extremely wide variety of data

Sometimes, two more Vs are added for variability and value. Some interesting statistics of data explosion are as follows:

  • There are 2 billion Internet users in the world
  • 6.8 billion mobile phones by the end of 2012 (more people have cellphones than toilets!)
  • 8 TB of data is processed by Twitter every day (this translates to 100 MB per second; our normal hard disk writes with the speed of 80 MB per second)
  • Facebook processes more than 500 TB of data every day!
  • 90 percent of the world's data has been generated over the last two years

Interestingly, 80 percent of Big Data is unstructured, and businesses now need fast, reliable, and deeper data insight.

Hadoop

Hadoop, an open source project from Apache Software Foundation, has become the de facto standard for storing, processing, and analyzing hundreds of terabytes, even petabytes of data. This framework was originally developed by Doug Cutting and Mike Cafarella in 2005, and named it after Doug's son's toy elephant. Written in Java, this framework is optimized to handle massive amounts of structured/unstructured data through parallelism using MapReduce on GoogleFS with the help of inexpensive commodity hardware.

Tip

Hadoop is used for Big Data complements, OLTP and OLAP. It is certainly not a replacement to a relational database.

Hadoop is a highly scalable model and supports unlimited linear scalability. It runs on commodity hardware, which is 1/10th of the enterprise hardware, and uses open source software. So, it can run 10 times faster at the same cost. It is distributed and reliable: by default, it keeps three times data redundancy, which can be further configured.

Hadoop, over a period of time, has become a full-fledged ecosystem by adding lots of new open source friends such as Hive, Pig, HBase, and ZooKeeper.

There are many Internet or social networking companies such as Yahoo!, Facebook, Amazon, eBay, Twitter, and LinkedIn that use Hadoop. Yahoo! Search Webmap was the largest Hadoop application when it went into production in 2008, with more than 10,000 core Linux clusters. As of today, Yahoo! has more than 40,000 nodes running in more than 20 Hadoop clusters.

Facebook's Hadoop clusters include the largest single HDFS (Hadoop Distributed File System) cluster known, with more than 100 PB physical disk space in a single HDFS filesystem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset