S3 and EMR data locality

问题

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:

EC2
EMR + S3

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

I'm hopping for an answer from some person having practical experience with this. Thank you.

回答1:

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

来源：https://stackoverflow.com/questions/44304104/s3-and-emr-data-locality

标签

amazon-web-services

Hadoop

amazon-s3

amazon-ec2

amazon-emr

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!