HDFS stores files across the cluster by breaking them down in to coarser grained fixed-size blocks. These coarser grained data blocks are replicated in different DataNodes mainly for the fault-tolerance purposes. Data block replication also has the ability to increase the data locality of the MapReduce computations and to increase the total data access bandwidth as well. Reducing the replication factor helps save the storage space in HDFS.
HDFS replication factor is a file-level property that can be set per file basis. This recipe shows how to change the default replication factor of a HDFS deployment affecting the new files that would be created afterwards, how to specify a custom replication factor at the time of file creation in HDFS, and how to change the replication factor of the existing files in HDFS.
dfs.replication
property in $HADOOP_HOME/conf/hdfs-site.xml
. This change would not change the replication factor of the files that are already in the HDFS. Only the files copied after the change will have the new replication factor.<property> <name>dfs.replication</name> <value>2</value> </property>
>bin/hadoop fs -D dfs.replication=1 -copyFromLocal non-critical-file.txt /user/foo
setrep
command can be used to change the replication factor of files or file paths that are already in the HDFS.> bin/hadoop fs -setrep 2 non-critical-file.txt Replication 3 set: hdfs://myhost:9000/user/foo/non-critical-file.txt
The setrep
command syntax is as follows:
hadoop fs -setrep [-R] <path>
The <path>
parameter of the setrep
command specifies the HDFS path where the replication factor has to be changed. The –R
option recursively sets the replication factor for files and directories within a directory.
The replication factor of a file is displayed when listing the files using the ls
command.
>bin/hadoop fs -ls Found 1 item -rw-r--r--2foo supergroup ... /user/foo/non-critical-file.txt
The replication factor of files is displayed when browsing files in the HDFS monitoring UI.