In this chapter we will cover:
Hadoop Distributed File System (HDFS) is a fault-tolerant distributed filesystem designed to run on "off-the-shelf" hardware. It has been optimized for streaming reads on large files whereas I/O throughput is favored over low latency. In addition, HDFS uses a simple model for data consistency where files can only be written to once.
HDFS assumes disk failure as an eventuality and uses a concept called
block replication to replicate data across nodes in the cluster. HDFS uses a much larger block size when compared to desktop filesystems. For example, the default block size for HDFS is 64 MB. Once a file has been placed into HDFS, the file is divided into one or more data blocks and is distributed to nodes in the cluster. In addition, copies of the data blocks are made, which again are distributed to nodes in the cluster to ensure high data availability in case of a disk failure. The number of copies HDFS makes of each data block is determined by the
replication factor setting. The default replication factor is 3
, meaning three replicas of a data block will be distributed across the nodes in the cluster.
Finally, applications using HDFS can achieve high throughput because the Hadoop framework was designed to move computation to the data. In other words, applications can run on the nodes where the data resides instead of moving the data to the application. This concept is known as data locality.
HDFS consists of three services:
This chapter will use the FileSystem API, MapReduce, and advanced serialization libraries to efficiently write and store data in HDFS.