Spark AWS emr checkpoint location

六月ゝ 毕业季﹏ 提交于 2019-12-25 09:10:45

问题


I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message

17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception: 
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020

Here is my sample code

...
val sparkConf = new SparkConf().setAppName("spark-job")
  .set("spark.default.parallelism", (CPU * 3).toString)
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
  .set("spark.dynamicAllocation.enabled", "true")


implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....

How can I checkpoint on AWS EMR?


回答1:


There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.

if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.




回答2:


There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.




回答3:


Try something with AWS authenticaton like:

val hadoopConf: Configuration = new Configuration()
  hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
  hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
  hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")

  sparkSession.sparkContext.getOrCreate(checkPointDir, () => 
      { createStreamingContext(checkPointDir, config) }, hadoopConf)


来源:https://stackoverflow.com/questions/42441870/spark-aws-emr-checkpoint-location

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!