Is checkpointing necessary in spark streaming

冷暖自知 提交于 2019-12-11 03:04:57

问题


I have noticed that spark streaming examples also have code for checkpointing. My question is how important is that checkpointing. If its there for fault tolerance, how often do faults happen in such streaming applications?


回答1:


It all depends on your use case. For suppose if you are running a streaming job, which just reads data from Kafka and counts the number of records. What would you do if your application crashes after a year or so?

  • If you don't have a backup/checkpoint, you will have to recompute all the previous one years worth data so you can resume counting.
  • If you have a backup/checkpoint, you can simply read the checkpoint data and resume instantly.

Or if all you are just doing is having a streaming application which just Reads-Messages-From-Kafka >>> Tranform >>> Insert-to-a-Database, I need not worry about my application crashing. Even if it's crashed, i can simply resume my application without loss of data.

Note: Check-pointing is a process which stores the current state of a spark application.

Coming to the frequency of fault tolerance, you can almost never predict an outage. In companies,

  • There might be power outage
  • regular maintainance/upgrading of cluster

hope this helps.




回答2:


There are two cases:

  1. You are doing stateful operations, such as updateStateByKey, then you must use checkpointing - every state is saved. Without setting checkpoint directory, an exception will be thrown.
  2. You are doing only windowed operations - then yes, you can disable checkpointing. However I strongly recommend setting checkpoint directory.

When driver is killed, then you'll loose all your data and progress information. Checkpointing helps you to recover applications from such situations.

Is a failure a normal situation? Of course! Imagine that you've got large cluster, many machines, many components in these machines. If one of these components fails, then your application will also fail. When connection to driver is lost - your application fails. With checkpoiting you can just run application again and it will recover state.



来源:https://stackoverflow.com/questions/39599863/is-checkpointing-necessary-in-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!