发表新帖

发表新帖

GroupIntoBatches for non-KV elements

后端未结

关注

 1  2005

伪装坚强ぢ

According to the Apache Beam 2.0.0 SDK Documentation GroupIntoBatches works only with KV collections.

My dataset contains only values and

相关标签:

1条回答

猫巷女王i

2021-01-16 06:02
It is required to provide KV inputs to GroupIntoBatches because the transform is implemented using state and timers, which are per key-and-window.

For each key+window pair, state and timers necessarily execute serially (or observably so). You have to manually express the available parallelism by providing keys (and windows, though no runner that I know of parallelizes over windows today). The two most common approaches are:
1. Use some natural key like a user ID
2. Choose some fixed number of shards and key randomly. This can be harder to tune. You have to have enough shards to get enough parallelism, but each shard needs to include enough data that GroupIntoBatches is actually useful.
Adding one dummy key to all elements as in your snippet will cause the transform to not execute in parallel at all. This is similar to the discussion at Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题