问题
I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message
17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
Here is my sample code
...
val sparkConf = new SparkConf().setAppName("spark-job")
.set("spark.default.parallelism", (CPU * 3).toString)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
.set("spark.dynamicAllocation.enabled", "true")
implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....
How can I checkpoint on AWS EMR?
回答1:
There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.
if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.
回答2:
There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.
回答3:
Try something with AWS authenticaton like:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
sparkSession.sparkContext.getOrCreate(checkPointDir, () =>
{ createStreamingContext(checkPointDir, config) }, hadoopConf)
来源:https://stackoverflow.com/questions/42441870/spark-aws-emr-checkpoint-location