Hadoop is a large scale, large-batch data processing system, which uses MapReduce for computation and HDFS for data storage. HDFS is the most reliable distributed filesystem with a configurable replication mechanism designed to be deployed on low-cost commodity hardware.
HDFS breaks files into chunks of a minimum of 64 MB blocks, where each block is replicated three times. The replication factor can be configured, and it has to be perfectly balanced depending upon the data. The following diagram depicts a typical two node Hadoop cluster set up on two bare metal machines, although you can use a virtual machine as well.
One of these is the master node, and the other one is the worker/slave node. The master node consists of JobTracker and NameNode. In the preceding diagram, the master node also acts as the slave as there are only two nodes in the illustration. There can be multiple slave nodes, but we have taken a single node for illustration purposes. A slave node, also known as a worker node, can act as a data node as well as a task tracker, though one can configure to have data-only worker nodes for data-intensive operations and compute-only worker nodes for CPU-intensive operations. Starting this Hadoop cluster can be performed in two steps: by starting HDFS daemons (NameNode and data node) and then starting MapReduce daemons (JobTracker and TaskTracker).
In a big cluster, there are dedicated roles to nodes; for example, HDFS is managed by a dedicated server to host the filesystem containing the edits
logfile, which will be merged with fsimage
at the startup time of NameNode. A secondary NameNode (SNN) keeps merging fsimage
with edits
regularly with configurable intervals (checkpoint or snapshot). The primary NameNode is a single point of failure for the cluster, and SNN reduces the risk by minimizing the downtime and loss of data.
Similarly, a standalone JobTracker server manages job scheduling whenever the job is submitted to the cluster. It is also a single point of failure. It will monitor all the TaskTracker nodes and, in case one task fails, it will relaunch that task automatically, possibly on a different TaskTracker node.
For the effective scheduling of tasks, every Hadoop supported filesystem should provide location consciousness, meaning that it should have the name of the rack (more incisively, of the network switch) where a worker node is. Hadoop applications can use this information while executing work on the respective node. HDFS uses this approach to create data replication efficiently by keeping data on different racks so that even when one rack goes down, data will still be served by the other rack.
The filesystem is at the bottom and the MapReduce engine is stacked above it. The MapReduce engine consists of a single JobTracker, which can be related to an order taker. Clients submit their MapReduce job requests to JobTracker that in turn passes the requests to the available TaskTracker from the cluster, with the intention of keeping the work as close to the data as possible. If work cannot be started on the worker node where the data is residing, then priority is assigned in the same rack with the intention of reducing the network traffic. If a TaskTracker server fails or times out for some reason, that part of the job will be rescheduled. The TaskTracker server is always a lightweight process to ensure reliability, and it's achieved by spawning off new processes when new jobs come to be processed on the respective node. The TaskTracker sends heartbeats periodically to the JobTracker to update the status. The JobTracker's and TaskTracker's current status and information can be viewed from a web browser.
At the time of writing, Hadoop 2 was still in alpha, but it would be worthwhile to mention its significant new enhancements here. Hadoop 2 has three major enhancements, namely, HDFS failover, Federated Namenode, and MapReduce 2.0 (MRv2) or YARN. There are a few distributions such as CDH4 (Cloudera's Distribution of Hadoop) and HDP2 (Hortonworks Data Platform), which are bundling Hadoop 2.0. Everywhere else, Hadoop is referred to as Hadoop 1.
Hadoop, along with its community friends, makes a complete ecosystem. This ecosystem is continuously evolving with a large number of open source contributors. The following diagram gives a high-level overview of the Hadoop ecosystem:
The Hadoop ecosystem is logically divided into five layers that are self-explanatory. Some of the ecosystem components are explained as follows:
While setting up the Hadoop ecosystem, you can either do the setup on your own or use third-party distributions from vendors such as Amazon, MapR, Cloudera, Hortonworks, and others. Third-party distributions may cost you a little extra, but it takes away the complexity of maintaining and supporting the system and you can focus on the business problem.
For the purpose of learning the basics of Pentaho and its Big Data features, we will use a Hortonworks Hadoop distribution virtual machine. Hortonworks Sandbox is a single-node implementation of the enterprise-ready Hortonworks Data Platform (HDP). HDP combines the most useful and stable version of Apache Hadoop and its related projects into a single tested and certified package. Included in this implementation are some good tutorials and easy-to-use tools accessible via a web browser.
We will use this VM as our working Hadoop framework throughout the book. When you're proficient with the use of this tool, you can apply your skills for a large-scale Hadoop cluster.
See Appendix A, Big Data Sets, for the complete installation, configuration, and sample data preparation guide.