Setting the file replication factor

HDFS stores files across the cluster by breaking them down in to coarser grained fixed-size blocks. These coarser grained data blocks are replicated in different DataNodes mainly for the fault-tolerance purposes. Data block replication also has the ability to increase the data locality of the MapReduce computations and to increase the total data access bandwidth as well. Reducing the replication factor helps save the storage space in HDFS.

HDFS replication factor is a file-level property that can be set per file basis. This recipe shows how to change the default replication factor of a HDFS deployment affecting the new files that would be created afterwards, how to specify a custom replication factor at the time of file creation in HDFS, and how to change the replication factor of the existing files in HDFS.

How to do it...

  1. To set the file replication factor using the NameNode configuration, add or modify the dfs.replication property in $HADOOP_HOME/conf/hdfs-site.xml. This change would not change the replication factor of the files that are already in the HDFS. Only the files copied after the change will have the new replication factor.
    <property>
      <name>dfs.replication</name>
      <value>2</value>
    </property>
    
  2. To set the file replication factor when uploading the files, you can specify the replication factor from the command line, as follows:
    >bin/hadoop fs -D dfs.replication=1 -copyFromLocal non-critical-file.txt /user/foo
    
  3. The setrep command can be used to change the replication factor of files or file paths that are already in the HDFS.
    > bin/hadoop fs -setrep 2 non-critical-file.txt
    Replication 3 set: hdfs://myhost:9000/user/foo/non-critical-file.txt
    

How it works...

The setrep command syntax is as follows:

hadoop fs -setrep [-R] <path>

The <path> parameter of the setrep command specifies the HDFS path where the replication factor has to be changed. The –R option recursively sets the replication factor for files and directories within a directory.

There's more...

The replication factor of a file is displayed when listing the files using the ls command.

>bin/hadoop fs -ls
Found 1 item
-rw-r--r--2foo supergroup ... /user/foo/non-critical-file.txt

The replication factor of files is displayed when browsing files in the HDFS monitoring UI.

See also

  • The Setting HDFS block size recipe in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset