Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Getting Hadoop Up and Running in a Cluster

In this chapter, we will cover:

Setting up Hadoop on your machine
Writing the WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
Adding the combiner step to the WordCount MapReduce program
Setting up HDFS
Using the HDFS monitoring UI
HDFS basic command-line file operations
Setting Hadoop in a distributed cluster environment
Running the WordCount program in a distributed cluster environment
Using the MapReduce monitoring UI

Introduction

For many years, users who want to store and analyze data would store the data in a database and process it via SQL queries. The Web has changed most of the assumptions of this era. On the Web, the data is unstructured and large, and the databases can neither capture the data into a schema nor scale it to store and process it.

Google was one of the first organizations to face the problem, where they wanted to download the whole of the Internet and index it to support search queries. They built a framework for large-scale data processing borrowing from the "map" and "reduce" functions of the functional programming paradigm. They called the paradigm MapReduce.

Hadoop is the most widely known and widely used implementation of the MapReduce paradigm. This chapter introduces Hadoop, describes how to install Hadoop, and shows you how to run your first MapReduce job with Hadoop.

Hadoop installation consists of four types of nodes—a NameNode , DataNodes , a JobTracker , and TaskTracker HDFS nodes (NameNode and DataNodes) provide a distributed filesystem where the JobTracker manages the jobs and TaskTrackers run tasks that perform parts of the job. Users submit MapReduce jobs to the JobTracker, which runs each of the Map and Reduce parts of the initial job in TaskTrackers, collects results, and finally emits the results.

Hadoop provides three installation choices:

Local mode: This is an unzip and run mode to get you started right away where all parts of Hadoop run within the same JVM
Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine
Distributed mode: This is the real setup that spans multiple machines

We will discuss the local mode in the first three recipes, and Pseudo distributed and distributed modes in the last three recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. Getting Hadoop Up and Running in a Cluster

Create new playlist

Sign In

Sign Up

Chapter 1. Getting Hadoop Up and Running in a Cluster

Introduction

Table of Contents for
1. Getting Hadoop Up and Running in a Cluster