Spark jobs finishes but application takes time to close

前端未结

关注

 3  498

Running spark job using scala, as expected all jobs are finishing up on time , but somehow some INFO logs are printed for 20-25 minutes before job stops.

Posting few

相关标签:

3条回答

你的背包

2020-12-10 14:01

I ended up upgrading my spark version and issue was resolved .

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-10 14:10
As I put in a comment, I recommend using the spark-csv package instead of sc.saveAsTextFile and there are no problems with writing directly to s3 using that package :)

I don't know if you use s3 or s3n, but maybe try to switch. I have experienced problems with using s3a on Spark 1.5.2 (EMR-4.2) where writes timed out all the time and switching back to s3 solved the problem, so it's worth a try.

A couple of other things that should speed up writes to s3 is to use the DirectOutputCommiter
```
conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
```
and disabling generation of _SUCCESS files:
```
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
```
Note that disabling _SUCCESS files has to be set on the hadoop configuration of the SparkContext and not on the SparkConf.

I hope this helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2020-12-10 14:18
I had the same kind of problem when writing files to S3. I use the spark 2.0 version, just to give you a updated code for the verified answer

In Spark 2.0 you can use,
```
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()

spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
```
This solved my problem of job getting struck
0 讨论(0)
发布评论:

提交评论
- 加载中...