Chapter 1. Getting Hadoop Up and Running in a Cluster

In this chapter, we will cover:

  • Setting up Hadoop on your machine
  • Writing the WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
  • Adding the combiner step to the WordCount MapReduce program
  • Setting up HDFS
  • Using the HDFS monitoring UI
  • HDFS basic command-line file operations
  • Setting Hadoop in a distributed cluster environment
  • Running the WordCount program in a distributed cluster environment
  • Using the MapReduce monitoring UI

Introduction

For many years, users who want to store and analyze data would store the data in a database and process it via SQL queries. The Web has changed most of the assumptions of this era. On the Web, the data is unstructured and large, and the databases can neither capture the data into a schema nor scale it to store and process it.

Google was one of the first organizations to face the problem, where they wanted to download the whole of the Internet and index it to support search queries. They built a framework for large-scale data processing borrowing from the "map" and "reduce" functions of the functional programming paradigm. They called the paradigm MapReduce.

Hadoop is the most widely known and widely used implementation of the MapReduce paradigm. This chapter introduces Hadoop, describes how to install Hadoop, and shows you how to run your first MapReduce job with Hadoop.

Hadoop installation consists of four types of nodes—a NameNode , DataNodes , a JobTracker , and TaskTracker HDFS nodes (NameNode and DataNodes) provide a distributed filesystem where the JobTracker manages the jobs and TaskTrackers run tasks that perform parts of the job. Users submit MapReduce jobs to the JobTracker, which runs each of the Map and Reduce parts of the initial job in TaskTrackers, collects results, and finally emits the results.

Hadoop provides three installation choices:

  • Local mode: This is an unzip and run mode to get you started right away where all parts of Hadoop run within the same JVM
  • Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine
  • Distributed mode: This is the real setup that spans multiple machines

We will discuss the local mode in the first three recipes, and Pseudo distributed and distributed modes in the last three recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset