54 | Big Data Simplied
will then look up in its internal table of contents to see where the first block (‘Block 1’) of
‘File1’ is stored. In this case, it happens to be DataNode 1.
The NameNode then forwards this request to the DataNode 1 to read the actual contents from
Block 1, and it is this content that is returned back to the client (Figure 3.6).
For example, three files are in HDFS with the size of a.txt (256 MB), b.txt (289 MB) and c.txt
(370 MB). Thus, HDFS will allocate a total of 8 blocks (default size of a block 128 MB) for these
three files. Here, a.txt will consume 2 blocks, b.txt and c.txt will absorb 3 blocks, respectively.
3.4.3 Fault Tolerance with Replication
Let us assume that a single le has been distributed across multiple nodes in the cluster as
discussed in the last section. Remember that each of these DataNodes in the cluster is commodity
hardware, which means it is prone to failure. There are two challenges to distributed storage.
First, how to manage the failure of DataNodes? Second, how to manage failure of the NameNode?
The fault tolerance strategy, which HDFS uses, is based on a replication factor. It is already
seen that a file is broken up into huge number of blocks distributed across DataNodes in a clus-
ter. No single machine holds the entire data for a single file. Further, every block of a file that is
stored in the cluster is replicated across multiple machines. Replication factor is a configuration
property that can be set. Based on this replication factor, the blocks which belong to a certain
file are replicated or copied to other nodes. The number of copies will depend on the replication
factor has been set.
In the example given in Figure 3.7, note that Block 1 and Block 2 of a file are stored on both
DataNode 1 and DataNode 2. This is an example where the replication factor is set to 2, i.e., each
block needs to have two replicas. Once these replicas are in place, how to get the information
about which DataNodes contain the replicas of a certain block? Well, the replica locations are also
stored in the NameNode. Just like the NameNode stores the mapping of file blocks, it also has a
mapping of the replicas of these blocks.
In the example given in Figure 3.8, the NameNode would also have entries for the replicas of
Block 1 and Block 2 on DataNode 2. Now, while choosing the locations for these replicas, two
FIGURE 3.6
Step 2 of reading a file in HDFS, such as client reads actual content of
Block1 from DataNode 1
DataNode 1
Block 1 Block 2
NameNode
File 1Block 1 DataNode 1
File 1Block 2 DataNode 1
File 1Block 3 DataNode 2
File 1Block 4 DataNode 2
File 1Block 5 DataNode 3
Client reads contents of Block 1
from DataNode 1
M03 Big Data Simplified XXXX 01.indd 54 5/10/2019 9:57:31 AM