Multiple CoGroupByKey with same key apache beam

后端未结

关注

 1  1396

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for

相关标签:

1条回答

遇见更好的自我

2021-01-25 18:44

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.

0 讨论(0)
发布评论:

提交评论
- 加载中...