Kinesis Streams and Flink

浪尽此生 提交于 2020-03-05 00:25:26

问题


I have a question regarding sharding data in a Kinesis stream. I would like to use a random partition key when sending user data to my kinesis stream so that the data in the shards is evenly distributed. For the sake of making this question simpler, I would then like to aggregate the user data by keying off of a userId in my Flink application.

My question is this: if the shards are randomly partitioned so that data for one userId is spread across multiple Kinesis shards, can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task? Or, do I need to shard the kinesis stream by user id before it is consumed by Flink?


回答1:


... can Flink handle reading off of multiple shards and then redistributing the data so that all of the data for a single userId is streamed to the same aggregator task?

The effect of a keyBy(e -> e.userId), if you use Flink's DataStream API, is to redistribute all of the events so that all events for any particular userId will be streamed to the same downstream aggregator task.

Would each host read in data from a subset of the shards in the stream and would Flink then use the keyBy operator to pass messages of the same key to the host that will perform the actual aggregation?

Yes, that's right.

If, for example, you have 8 physical hosts, each providing 8 slots for running the job, then there will be 64 instances of the aggregator task, each of which will be responsible for a disjoint subset of the key space.

Assuming there are more than 64 shards available to read from, then each in each of the 64 tasks, the source will read from one or more shards, and then distribute the events it reads according to their userIds. Assuming the userIds are evenly spread across the shards, then each source instance will find that a few of the events it reads are for userIds it is been assigned to handle, and the local aggregator should be used. The rest of the events will each need to be sent out to one of the other 63 aggregators, depending on which worker is responsible for each userId.



来源:https://stackoverflow.com/questions/60233840/kinesis-streams-and-flink

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!