Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a fault-tolerant distributed filesystem, which is designed to run on low-cost hardware, and able to handle very large datasets (in the order of hundreds of petabytes to exabytes). Although the HDFS requires a fast network connection to transfer data across nodes, the latency can't be as low as in classic filesystems (it may be in the order of seconds); therefore, the HDFS has been designed for batch processing and high throughput. Each HDFS node contains a part of the filesystem's data; the same data is also replicated in other instances, and this ensures a high throughput access and fault-tolerance.

The HDFS's architecture is master-slave. If the master (called NameNode) fails, there is a secondary/backup node ready to take control. All the other instances are slaves (DataNodes); if one of them fails, it's not a problem as the HDFS has been designed with this in mind, so no data is lost (it is redundantly replicated) and operations are quickly redistributed to surviving nodes. DataNodes contain blocks of data: each file saved in the HDFS is broken up into chunks (or blocks), typically 64 MB each, and then distributed and replicated in a set of DataNodes. The NameNode stores only the metadata of the files in the distributed file system; it doesn't store any actual data, rather it just stores the right indications on how to access the files in the multiple DataNodes that it manages.

A client asking to read a file must first contact the NameNode, which will give back a table containing an ordered list of blocks and their locations (as in DataNodes). At this point, the client should contact the DataNodes separately, downloading all the blocks and reconstructing the file (by appending the blocks together).

To write a file, a client should instead first contact the NameNode, which will first decide how to handle the request, then update its records and reply to the client with an ordered list of DataNodes of where to write each block of the file. The client will now contact and upload the blocks to the DataNodes, as reported in the NameNode reply. Namespace queries (for example, listing a directory content, creating a folder, and so on) are instead completely handled by the NameNode by accessing its metadata information.

Moreover, the NameNode is also responsible for properly handling a DataNode failure (it's marked as dead if no heartbeat packets are received) and its data re-replication to other nodes.

Although these operations are long and hard to implement with robustness, they're completely transparent to the user, thanks to many libraries and the HDFS shell. The way you operate on the HDFS is pretty similar to what you're currently doing on your filesystem, and this is a great benefit of Hadoop: hiding the complexity and letting the user use it simply.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset