Flink Kinesis Consumer not storing last successfully processed sequence nos

问题

We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.

KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.

But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.

A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.

Do we need to store these last successfully consumed sequence nos ourselves ?

回答1:

Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.

Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.

Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.

Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.

回答2:

To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.

Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.

However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.

Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.

Do you see any benefit in storing the offsets additionally?

来源：https://stackoverflow.com/questions/54825364/flink-kinesis-consumer-not-storing-last-successfully-processed-sequence-nos

标签

apache-flink

flink-streaming

amazon-kinesis