Hadoop Distributed Copy (distcp
) is a tool for efficiently copying large amounts of data within or in between clusters. It uses the MapReduce
framework to do the copying. The benefits of using MapReduce include parallelism, error handling, recovery, logging, and reporting. The Hadoop Distributed Copy command (distcp
) is useful when moving data between development, research, and production cluster environments.
The source and destination clusters must be able to reach each other.
The source cluster should have speculative execution turned off for map tasks. In the mapred-site.xml
configuration file, set mapred.map.tasks.speculative.execution
to false
. This will prevent any undefined behavior from occurring in the case where a map task fails.
The source and destination cluster must use the same RPC protocol. Typically, this means that the source and destination cluster should have the same version of Hadoop installed.
Complete the following steps to copy a folder from one cluster to another:
hadoop distcp hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
hadoop distcp –overwrite hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
hadoop distcp –update hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
On the source cluster, the contents of the folder being copied are treated as a large temporary file. A map-only MapReduce job is created, which will do the copying between clusters. By default, each mapper will be given a 256-MB block of the temporary file. For example, if the weblogs folder was 10 GB in size, 40 mappers would each get roughly 256 MB to copy. distcp
also has an option to specify the number of mappers.
hadoop distcp –m 10 hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs
In the previous example, 10 mappers would be used. If the weblogs folder was 10 GB in size, then each mapper would be given roughly 1 GB to copy.
While copying between two clusters that are running different versions of Hadoop, it is generally recommended to use HftpFileSystem
as the source. HftpFileSystem
is a read-only filesystem. The
distcp
command has to be run from the
destination server:
hadoop distcp hftp://namenodeA:port/data/weblogs hdfs://namenodeB/data/weblogs
In the preceding command, port
is defined by the dfs.http.address
property in the hdfs-site.xml
configuration file.