spark streaming checkpoint recovery is very very slow

后端 未结 3 851
隐瞒了意图╮
隐瞒了意图╮ 2021-02-12 14:26
  • Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
  • Situation: Application runs fine initially, running batches of 1hour and th
3条回答
  •  深忆病人
    2021-02-12 14:29

    raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304

    The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults aws call.

    Fix: https://github.com/apache/spark/pull/16842

提交回复
热议问题