Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. HDFS

In this chapter we will cover:

Reading and writing data to HDFS
Compressing data using LZO
Reading and writing data to SequenceFiles
Using Apache Avro to serialize data
Using Apache Thrift to serialize data
Using Protocol Buffers to serialize data
Setting the replication factor for HDFS
Setting the block size for HDFS

Introduction

Hadoop Distributed File System (HDFS) is a fault-tolerant distributed filesystem designed to run on "off-the-shelf" hardware. It has been optimized for streaming reads on large files whereas I/O throughput is favored over low latency. In addition, HDFS uses a simple model for data consistency where files can only be written to once.

HDFS assumes disk failure as an eventuality and uses a concept called block replication to replicate data across nodes in the cluster. HDFS uses a much larger block size when compared to desktop filesystems. For example, the default block size for HDFS is 64 MB. Once a file has been placed into HDFS, the file is divided into one or more data blocks and is distributed to nodes in the cluster. In addition, copies of the data blocks are made, which again are distributed to nodes in the cluster to ensure high data availability in case of a disk failure. The number of copies HDFS makes of each data block is determined by the replication factor setting. The default replication factor is 3, meaning three replicas of a data block will be distributed across the nodes in the cluster.

Finally, applications using HDFS can achieve high throughput because the Hadoop framework was designed to move computation to the data. In other words, applications can run on the nodes where the data resides instead of moving the data to the application. This concept is known as data locality.

HDFS consists of three services:

HDFS Application	Purpose
NameNode	This maintains a catalog of all block locations in the cluster
Secondary NameNode	This periodically synchronizes with the NameNode block index. During the synchronizing process, the Secondary NameNode retrieves the current NameNode image and edit logs, merges them together, and then sends the merged image back to the NameNode. The Secondary NameNode is not a "hot" backup of the NameNode. It cannot be used in the event of a NameNode failure.
DataNode	This manages the data blocks it receives from the NameNode. It is unaware of any other DataNodes in the cluster and only communicates with the NameNode.

This chapter will use the FileSystem API, MapReduce, and advanced serialization libraries to efficiently write and store data in HDFS.

Note

Version 0.20.x does not support append operations

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. HDFS

Create new playlist

Sign In

Sign Up

Chapter 2. HDFS

Introduction

Note

Table of Contents for
2. HDFS