问题
I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp:
s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/
hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/
I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my cluster, and the hadoop job runs forever. As soon as i run the command,
16/07/18 18:43:55 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:44:02 INFO mapreduce.Job: map 100% reduce 0%
16/07/18 18:44:08 INFO mapreduce.Job: map 100% reduce 14%
16/07/18 18:44:11 INFO mapreduce.Job: map 100% reduce 29%
16/07/18 18:44:13 INFO mapreduce.Job: map 100% reduce 86%
16/07/18 18:44:18 INFO mapreduce.Job: map 100% reduce 100%
This gets printed immediately and then copies over data for an hour. It starts all over again.
16/07/18 19:52:45 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:52:53 INFO mapreduce.Job: map 100% reduce 0%
Am i missing anything here? Any help is appreciated.
Also I would like to know where can i find the log files on the master node to see if the job is failing and hence looping? Thanks
回答1:
In my case, I copy a single large compressed file from hdfs to s3, and hadoop distcp is much faster then s3-dist-cp.
When I check log, multi uploading part takes very long time at reduce step. Uploading a block(134MB) takes 20 secs for s3-dist-cp, while it takes only 4 secs for hadoop distcp.
Difference between distcp and s3-dist-cp is distcp creates temp files at s3(at destination file system), while s3-dist-cp creates temp files at hdfs.
I am still investigating why multi uploading performance is much different with distcp and s3-dist-cp, hope some one with good insights can contribute here.
回答2:
If you could pick up Hadoop 2.8.0 for your investigations, and use s3a:// filesystem, you can grab lots of filesystem statistics it now collects.
A real performance killer is rename(), which is mimicked in the s3 clients by doing a copy then a delete: if either distcp run is trying to do atomic distcp with renames, that'll add a delay of about 1 second for every 6-10MB of data. that 134MB for 16s of post-upload delay would go with the "it's a rename"
来源:https://stackoverflow.com/questions/38462480/s3-dist-cp-and-hadoop-distcp-job-infinitely-loopin-in-emr