How to copy data from one HDFS to another HDFS?

后端 未结 6 1260
日久生厌
日久生厌 2021-01-30 11:30

I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other c

6条回答
  •  无人共我
    2021-01-30 12:21

    Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop Filesystems in parallel. The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of hadoop, then the hdfs scheme is appropriate to use.

    $ hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
    

    The data in /foo directory of namenode1 will be copied to /bar directory of namenode2. If the /bar directory does not exist, it will create it. Also we can mention multiple source paths.

    Similar to rsync command, distcp command by default will skip the files that already exist. We can also use -overwrite option to overwrite the existing files in destination directory. The option -update will only update the files that have changed.

    $ hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo
    

    distcp can also be implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There will be no reducers.

    If trying to copy data between two HDFS clusters that are running different versions, the copy will process will fail, since the RPC systems are incompatible. In that case we need to use the read-only HTTP based HFTP filesystems to read from the source. Here the job has to run on destination cluster.

    $ hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar
    

    50070 is the default port number for namenode's embedded web server.

提交回复
热议问题