Why does GroupIntoBatches output get subdivided when input to next transform

问题

I have a Python Apache Beam batch pipeline running on Dataflow (runner v2) that reads in rows from a CSV file, where each row is a simple key,value. I want to group these elements by key into batches of 10 values each, and then feed each batch into the following ParDo transform to be wrapped in another key for partitioning. This (should) effectively allow me precise control of the distribution of elements into each partition.

class ParseExamplesDoFn(beam.DoFn):
    def process(self, row):
        components = row.split(',')
        yield components[0], components[1]

class WrapKeysDoFn(beam.DoFn):
    def process(self, batch):
        for i, value in enumerate(batch[1]):
            if i == 0:
                yield 1, (batch[0], value)
            elif i == 1:
                yield 2, (batch[0], value)
            else:
                yield 0, (batch[0], value)

part0, part1, part2 = (p
    | 'Read file' >> beam.io.textio.ReadFromText('gs://foo/bar.csv')
    | 'Parse rows' >> beam.ParDo(ParseExamplesDoFn())
    | 'Create batches' >> beam.GroupIntoBatches(10)
    | 'Wrap keys' >> beam.ParDo(WrapKeysDoFn())
    | 'Partition' >> beam.Partition(lambda x, n: x[0], 3))

The idea is that the first value goes to part1, the second to part2 and the next eight to part0 then repeat until exhausted. If I had a key with 13 values then I expect 9 of them to go to part0, 2 to part1 and 2 to part2. This is not what actually happens.

I may have a misunderstanding of how PCollections are fed into PTransforms, but my expectation was that each batch created by GroupIntoBatches would be maintained when fed to WrapKeysDoFn. Instead of the split being 9-2-2, the split varies between 5-4-4, 7-3-3, 6-4-3, and 8-3-2. This suggests to me that the batches are being further split when input to WrapKeysDoFn, and that such splitting is determined at runtime in effort to maximize throughput by parallelizing. This means that each batch of 10 may be broken up into variable-sized mini-batches which would then disrupt the intended logic of WrapKeysDoFn.

I checked to make sure that the GroupIntoBatches was actually batching correctly by writing the results to a text file. The results were exactly what I expected. Every batch was of the same key and 10 in length except for the last where there were insufficient elements. The issue is not with the GroupIntoBatches transform itself, but rather what happens next.

Out of curiosity, I went and replaced the GroupIntoBatches(10) transform with GroupByKey. This caused the pipeline to behave as expected, where the splits looked like 8719-1-1, 1327-1-1, 2357-1-1, etc....

Why is it that the list values of GroupByKey are maintained, while those from GroupIntoBatches are not? Am I missing something?

来源：https://stackoverflow.com/questions/65526982/why-does-groupintobatches-output-get-subdivided-when-input-to-next-transform

标签

python

google-cloud-dataflow

apache-beam