Chapter 2. Advanced HDFS

In this chapter, we will cover:

  • Benchmarking HDFS
  • Adding a new DataNode
  • Decommissioning DataNodes
  • Using multiple disks/volumes and limiting HDFS disk usage
  • Setting HDFS block size
  • Setting the file replication factor
  • Using HDFS Java API
  • Using HDFS C API (libhdfs)
  • Mounting HDFS (Fuse-DFS)
  • Merging files in HDFS

Introduction

Hadoop Distributed File System (HDFS) is a block-structured, distributed filesystem that is designed to run on a low-cost commodity hardware. HDFS supports storing massive amounts of data and provides high-throughput access to the data. HDFS stores file data across multiple nodes with redundancy to ensure fault-tolerance and high aggregate bandwidth.

HDFS is the default distributed filesystem used by the Hadoop MapReduce computations. Hadoop supports data locality aware processing of the data stored in HDFS. However, HDFS can be used as a general purpose distributed filesystem as well. HDFS architecture consists mainly of a centralized NameNode that handles the filesystem metadata and DataNodes that store the real data blocks. HDFS data blocks are typically coarser grained and perform better with large data products.

Setting up HDFS and other related recipes in Chapter 1, Getting Hadoop Up and Running in a Cluster, show how to deploy HDFS and give an overview of the basic operation of HDFS. In this chapter, you will be introduced to a selected set of advanced HDFS operations that would be useful when performing large-scale data processing with Hadoop MapReduce, as well as when using HDFS as a standalone distributed filesystem for non-MapReduce use cases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset