How to create groups of N elements from a PCollection Apache Beam Python

前端 未结 1 1744
忘了有多久
忘了有多久 2021-02-04 20:57

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow

The answer in the above link is in Java, whereas the language I\'m working with is Pyth

1条回答
  •  离开以前
    2021-02-04 21:13

    Assume the grouping order is not important, you can just group inside a DoFn.

    class Group(beam.DoFn):
      def __init__(self, n):
         self._n = n
         self._buffer = []
    
      def process(self, element):
         self._buffer.append(element)
         if len(self._buffer) == self._n:
            yield list(self._buffer)
            self._buffer = []
    
      def finish_bundle(self):
         if len(self._buffer) != 0:
            yield list(self._buffer)
            self._buffer = []
    
    lines = p | 'File reading' >> ReadFromText(known_args.input)
              | 'Group' >> beam.ParDo(Group(known_args.N)
              ...
    

    0 讨论(0)
提交回复
热议问题