Multiple CoGroupByKey with same key apache beam

后端 未结 1 1394
谎友^
谎友^ 2021-01-25 18:11

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for

相关标签:
1条回答
  • 2021-01-25 18:44

    Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.

    0 讨论(0)
提交回复
热议问题