I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow
The answer in the above link is in Java, whereas the language I\'m working with is Pyth
Assume the grouping order is not important, you can just group inside a DoFn
.
class Group(beam.DoFn):
def __init__(self, n):
self._n = n
self._buffer = []
def process(self, element):
self._buffer.append(element)
if len(self._buffer) == self._n:
yield list(self._buffer)
self._buffer = []
def finish_bundle(self):
if len(self._buffer) != 0:
yield list(self._buffer)
self._buffer = []
lines = p | 'File reading' >> ReadFromText(known_args.input)
| 'Group' >> beam.ParDo(Group(known_args.N)
...