Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

Distributed copy

You may want to copy data from one cluster to another cluster. This may be due to the decommission of an old cluster or because of a requirement for similar data for some reporting or processing purpose. The -distcp command is used to copy data from one HDFS supported system to another HDFS supported system.

 distcp uses the MapReduce job to perform data distribution, error handling, recovery, and reporting. It generates certain map tasks, where each task is responsible for copying a few files to another cluster:

hadoop distcp hdfs://198.20.87.78:8020/user/packt/dir1 \ hdfs://198.89.76.34:8020/user/packt/dir2

You can also specify multiple sources for data copy:


hadoop distcp hdfs://198.20.87.78:8020/user/packt/dir1 \
hdfs://198.20.87.78:8020/user/packt/dir2 \
hdfs://198.89.76.34:8020/user/packt/dir3

When multiple sources are specified, distcp will abort the operation if a source has a collision. By default, if a file already exists at the destination, the new file will not be skipped, but we can use the different available options to overwrite the destination file.