问题
In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS.
Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed.
Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?
回答1:
That's problematic on a different level.
S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close()
or flush()
) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.
What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.
However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).
回答2:
In terms of performance HDFS is better than S3
HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms
S3 is better if your load is variable, requires high durability and persistence with less cost.
For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html
回答3:
You must use S3 if you want to terminate the EMR cluster, because once you terminate the cluster - HDFS data will be deleted.
来源:https://stackoverflow.com/questions/20143216/aws-emr-performance-hdfs-vs-s3