Running spark job using scala, as expected all jobs are finishing up on time , but somehow some INFO logs are printed for 20-25 minutes before job stops.
Posting few
As I put in a comment, I recommend using the spark-csv package instead of sc.saveAsTextFile
and there are no problems with writing directly to s3 using that package :)
I don't know if you use s3 or s3n, but maybe try to switch. I have experienced problems with using s3a on Spark 1.5.2 (EMR-4.2) where writes timed out all the time and switching back to s3 solved the problem, so it's worth a try.
A couple of other things that should speed up writes to s3 is to use the DirectOutputCommiter
conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
and disabling generation of _SUCCESS files:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Note that disabling _SUCCESS files has to be set on the hadoop configuration of the SparkContext
and not on the SparkConf
.
I hope this helps.