spark streaming checkpoint recovery is very very slow

后端未结

关注

 3  859

隐瞒了意图╮

Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
Situation: Application runs fine initially, running batches of 1hour and th

相关标签:

3条回答

深忆病人

2021-02-12 14:29

raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304

The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults aws call.

Fix: https://github.com/apache/spark/pull/16842

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-02-12 14:39

I had similar issues before, my application getting slower and slower.

try to release memory after using rdd, call rdd.unpersist() https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#unpersist(boolean)

or spark.streaming.backpressure.enabled to true

http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval

http://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements

also, check your locality setting, maybe too much data move around.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-02-12 14:43
When a failed driver is restart, the following occurs:
1. Recover computation – The checkpointed information is used to restart the driver, reconstruct the contexts and restart all the receivers.
2. Recover block metadata – The metadata of all the blocks that will be necessary to continue the processing will be recovered.
3. Re-generate incomplete jobs – For the batches with processing that has not completed due to the failure, the RDDs and corresponding jobs are regenerated using the recovered block metadata.
4. Read the block saved in the logs – When those jobs are executed, the block data is read directly from the write ahead logs. This recovers all the necessary data that were reliably saved to the logs.
5. Resend unacknowledged data – The buffered data that was not saved to the log at the time of failure will be sent again by the source. as it had not been acknowledged by the receiver.
Since all these steps are performed at driver your batch of 0 events take so much time. This should happen with the first batch only then things will be normal.

Reference here.
0 讨论(0)
发布评论:

提交评论
- 加载中...