spark streaming checkpoint recovery is very very slow

后端 未结 3 823
隐瞒了意图╮
隐瞒了意图╮ 2021-02-12 14:26
  • Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
  • Situation: Application runs fine initially, running batches of 1hour and th
相关标签:
3条回答
  • 2021-02-12 14:29

    raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304

    The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults aws call.

    Fix: https://github.com/apache/spark/pull/16842

    0 讨论(0)
  • 2021-02-12 14:39

    I had similar issues before, my application getting slower and slower.

    try to release memory after using rdd, call rdd.unpersist() https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#unpersist(boolean)

    or spark.streaming.backpressure.enabled to true

    http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval

    http://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements

    also, check your locality setting, maybe too much data move around.

    0 讨论(0)
  • 2021-02-12 14:43

    When a failed driver is restart, the following occurs:

    1. Recover computation – The checkpointed information is used to restart the driver, reconstruct the contexts and restart all the receivers.
    2. Recover block metadata – The metadata of all the blocks that will be necessary to continue the processing will be recovered.
    3. Re-generate incomplete jobs – For the batches with processing that has not completed due to the failure, the RDDs and corresponding jobs are regenerated using the recovered block metadata.
    4. Read the block saved in the logs – When those jobs are executed, the block data is read directly from the write ahead logs. This recovers all the necessary data that were reliably saved to the logs.
    5. Resend unacknowledged data – The buffered data that was not saved to the log at the time of failure will be sent again by the source. as it had not been acknowledged by the receiver.

    Since all these steps are performed at driver your batch of 0 events take so much time. This should happen with the first batch only then things will be normal.

    Reference here.

    0 讨论(0)
提交回复
热议问题