The Hadoop architecture

Hadoop is a large scale, large-batch data processing system, which uses MapReduce for computation and HDFS for data storage. HDFS is the most reliable distributed filesystem with a configurable replication mechanism designed to be deployed on low-cost commodity hardware.

HDFS breaks files into chunks of a minimum of 64 MB blocks, where each block is replicated three times. The replication factor can be configured, and it has to be perfectly balanced depending upon the data. The following diagram depicts a typical two node Hadoop cluster set up on two bare metal machines, although you can use a virtual machine as well.

The Hadoop architecture

One of these is the master node, and the other one is the worker/slave node. The master node consists of JobTracker and NameNode. In the preceding diagram, the master node also acts as the slave as there are only two nodes in the illustration. There can be multiple slave nodes, but we have taken a single node for illustration purposes. A slave node, also known as a worker node, can act as a data node as well as a task tracker, though one can configure to have data-only worker nodes for data-intensive operations and compute-only worker nodes for CPU-intensive operations. Starting this Hadoop cluster can be performed in two steps: by starting HDFS daemons (NameNode and data node) and then starting MapReduce daemons (JobTracker and TaskTracker).

In a big cluster, there are dedicated roles to nodes; for example, HDFS is managed by a dedicated server to host the filesystem containing the edits logfile, which will be merged with fsimage at the startup time of NameNode. A secondary NameNode (SNN) keeps merging fsimage with edits regularly with configurable intervals (checkpoint or snapshot). The primary NameNode is a single point of failure for the cluster, and SNN reduces the risk by minimizing the downtime and loss of data.

Similarly, a standalone JobTracker server manages job scheduling whenever the job is submitted to the cluster. It is also a single point of failure. It will monitor all the TaskTracker nodes and, in case one task fails, it will relaunch that task automatically, possibly on a different TaskTracker node.

For the effective scheduling of tasks, every Hadoop supported filesystem should provide location consciousness, meaning that it should have the name of the rack (more incisively, of the network switch) where a worker node is. Hadoop applications can use this information while executing work on the respective node. HDFS uses this approach to create data replication efficiently by keeping data on different racks so that even when one rack goes down, data will still be served by the other rack.

The filesystem is at the bottom and the MapReduce engine is stacked above it. The MapReduce engine consists of a single JobTracker, which can be related to an order taker. Clients submit their MapReduce job requests to JobTracker that in turn passes the requests to the available TaskTracker from the cluster, with the intention of keeping the work as close to the data as possible. If work cannot be started on the worker node where the data is residing, then priority is assigned in the same rack with the intention of reducing the network traffic. If a TaskTracker server fails or times out for some reason, that part of the job will be rescheduled. The TaskTracker server is always a lightweight process to ensure reliability, and it's achieved by spawning off new processes when new jobs come to be processed on the respective node. The TaskTracker sends heartbeats periodically to the JobTracker to update the status. The JobTracker's and TaskTracker's current status and information can be viewed from a web browser.

At the time of writing, Hadoop 2 was still in alpha, but it would be worthwhile to mention its significant new enhancements here. Hadoop 2 has three major enhancements, namely, HDFS failover, Federated Namenode, and MapReduce 2.0 (MRv2) or YARN. There are a few distributions such as CDH4 (Cloudera's Distribution of Hadoop) and HDP2 (Hortonworks Data Platform), which are bundling Hadoop 2.0. Everywhere else, Hadoop is referred to as Hadoop 1.

The Hadoop ecosystem

Hadoop, along with its community friends, makes a complete ecosystem. This ecosystem is continuously evolving with a large number of open source contributors. The following diagram gives a high-level overview of the Hadoop ecosystem:

The Hadoop ecosystem

The Hadoop ecosystem is logically divided into five layers that are self-explanatory. Some of the ecosystem components are explained as follows:

  • Data Storage: This is where the raw data resides. There are multiple filesystems supported by Hadoop, and there are also connectors available for the data warehouse (DW) and relational databases as shown:
    • HDFS: This is a distributed filesystem that comes with the Hadoop framework. It uses the TCP/IP layer for communication. An advantage of using HDFS is its data intelligence as that determines what data resides within which worker node.
    • Amazon S3: This is a filesystem from Amazon Web Services (AWS), which is an Internet-based storage. As it is fully controlled by AWS in their cloud, data intelligence is not possible with the Hadoop master and efficiency could be lower because of network traffic.
    • MapR-FS: This provides higher availability, transactionally correct snapshots, and higher performance than HDFS. MapR-FS comes with MapR's Hadoop distribution.
    • HBase: This is a columnar, multidimensional database derived from Google's BigTable. Based on the HDFS filesystem, it maintains data in partitions and, therefore, can give data access efficiently in a sorted manner.
  • Data Access: This layer helps in accessing the data from various data stores, shown as follows:
    • Hive: This is a data warehouse infrastructure with SQL-like querying capabilities on Hadoop data sets. Its power lies in the SQL interface that helps to quickly check/validate the data, which makes it quite popular in the developer community.
    • Pig: This is a data flow engine and multiprocess execution framework. Its scripting language is called Pig Latin. The Pig interpreter translates these scripts into MapReduce jobs, so even if you are a business user, you can execute the scripts and study the data analysis in the Hadoop cluster.
    • Avro: This is one of the data serialization systems, which provides a rich data format, a container file to store persistent data, a remote procedure call, and so on. It uses JSON to define data types, and data is serialized in compact binary data.
    • Mahout: This is a machine learning software with core algorithms as (use- and item-based) recommendation or batch-based collaborative filtering, classification, and clustering. The core algorithms are implemented on top of Apache Hadoop using the MapReduce paradigm, though it can also be used outside the Hadoop world as a math library focused on linear algebra and statistics.
    • Sqoop: This is designed to scoop up bulk data expeditiously between Apache Hadoop and structured data stores such as relational databases. Sqoop has become a top-level Apache project since March 2012. You could also call it an ETL tool for Hadoop. It uses the MapReduce algorithm to import or export data supporting parallel processing as well as fault tolerance.
  • Management layer: This comprises of tools that assist in administering the Hadoop infrastructure and are shown as follows:
    • Oozie: This is a workflow scheduler system to manage Apache Hadoop jobs. It is a server-based workflow engine, where workflow is a collection of actions such as Hadoop MapReduce, Pig/Hive/Sqoop jobs arranged in a control dependency DAG (Directed Acyclic Graph). Oozie is a scalable, reliable, and extensible system.
    • Elastic MapReduce (EMR): This provisions the Hadoop cluster, running and terminating jobs, and also handling data transfers between EC2 and S3 are automated by Amazon's Elastic MapReduce.
    • Chukwa: This is an open source data collection system for monitoring large, distributed systems. Chukwa is built on top of HDFS and the MapReduce framework, and inherits Hadoop's scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to make the best use of the collected data.
    • Flume: This is a distributed service comprising of multiple agents, which essentially sit outside the Hadoop cluster, to collect and aggregate streaming data (for example, log data) efficiently. It has a fault tolerant mechanism, using which it can act as a reliable streaming data feed to HDFS for real-time analysis.
    • ZooKeeper: This is another Apache Software Foundation project, which provides open source distributed coordination and synchronization services as well as a naming registry for large distributed systems. ZooKeeper's architecture supports high availability through redundant services. It uses a hierarchical filesystem and is fault tolerant and high performing, facilitating loose coupling. ZooKeeper is already being used by many Apache projects such as HDFS and HBase, as well as its run in production by Yahoo!, Facebook, and Rackspace.
  • Data Analytics: This is the area where a lot of third-party vendors provide various proprietary as well as open source tools. A few of them are as follows:
    • Pentaho: This has the capability of Data Integration (Kettle), analytics, reporting, creating dashboards, and predictive analytics directly from the Hadoop nodes. It is available with enterprise support as well as the community edition.
    • Storm: This is a free and open source distributed, fault tolerant, and real-time computation system for unbounded streams of data.
    • Splunk: This is an enterprise application, which can perform real-time and historical searches, reporting, and statistical analysis. It also provides the cloud-based flavor, Splunk Storm.

Note

The real advantages of Hadoop are its scalability, reliability, open source software, distributed data, more data better than complex algorithms, and schema on read. It has non-batch components like HBase.

While setting up the Hadoop ecosystem, you can either do the setup on your own or use third-party distributions from vendors such as Amazon, MapR, Cloudera, Hortonworks, and others. Third-party distributions may cost you a little extra, but it takes away the complexity of maintaining and supporting the system and you can focus on the business problem.

Hortonworks Sandbox

For the purpose of learning the basics of Pentaho and its Big Data features, we will use a Hortonworks Hadoop distribution virtual machine. Hortonworks Sandbox is a single-node implementation of the enterprise-ready Hortonworks Data Platform (HDP). HDP combines the most useful and stable version of Apache Hadoop and its related projects into a single tested and certified package. Included in this implementation are some good tutorials and easy-to-use tools accessible via a web browser.

We will use this VM as our working Hadoop framework throughout the book. When you're proficient with the use of this tool, you can apply your skills for a large-scale Hadoop cluster.

See Appendix A, Big Data Sets, for the complete installation, configuration, and sample data preparation guide.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset