问题
We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).
The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.
Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?
Aim: How to avoid this issue and record the correct state that doesn't miss any data?
Failed checkpoint: enter image description here
Completed checkpoint: enter image description here
subtask didn't respond enter image description here
Thanks
回答1:
There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.
Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.
Sounds like you should extend the timeout, which you can do like this:
env.getCheckpointConfig().setCheckpointTimeout(n);
where n
is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.
来源:https://stackoverflow.com/questions/55857289/flink-checkpoint-failure-checkpoints-time-out-after-10-mins