问题
I have a Python Apache Beam batch pipeline running on Dataflow (runner v2) that reads in rows from a CSV file, where each row is a simple key,value
. I want to group these elements by key into batches of 10 values each, and then feed each batch into the following ParDo transform to be wrapped in another key for partitioning. This (should) effectively allow me precise control of the distribution of elements into each partition.
class ParseExamplesDoFn(beam.DoFn):
def process(self, row):
components = row.split(',')
yield components[0], components[1]
class WrapKeysDoFn(beam.DoFn):
def process(self, batch):
for i, value in enumerate(batch[1]):
if i == 0:
yield 1, (batch[0], value)
elif i == 1:
yield 2, (batch[0], value)
else:
yield 0, (batch[0], value)
part0, part1, part2 = (p
| 'Read file' >> beam.io.textio.ReadFromText('gs://foo/bar.csv')
| 'Parse rows' >> beam.ParDo(ParseExamplesDoFn())
| 'Create batches' >> beam.GroupIntoBatches(10)
| 'Wrap keys' >> beam.ParDo(WrapKeysDoFn())
| 'Partition' >> beam.Partition(lambda x, n: x[0], 3))
The idea is that the first value goes to part1
, the second to part2
and the next eight to part0
then repeat until exhausted. If I had a key with 13 values then I expect 9 of them to go to part0
, 2 to part1
and 2 to part2
. This is not what actually happens.
I may have a misunderstanding of how PCollections are fed into PTransforms, but my expectation was that each batch created by GroupIntoBatches
would be maintained when fed to WrapKeysDoFn
. Instead of the split being 9-2-2, the split varies between 5-4-4, 7-3-3, 6-4-3, and 8-3-2. This suggests to me that the batches are being further split when input to WrapKeysDoFn
, and that such splitting is determined at runtime in effort to maximize throughput by parallelizing. This means that each batch of 10 may be broken up into variable-sized mini-batches which would then disrupt the intended logic of WrapKeysDoFn
.
I checked to make sure that the GroupIntoBatches
was actually batching correctly by writing the results to a text file. The results were exactly what I expected. Every batch was of the same key and 10 in length except for the last where there were insufficient elements. The issue is not with the GroupIntoBatches
transform itself, but rather what happens next.
Out of curiosity, I went and replaced the GroupIntoBatches(10)
transform with GroupByKey
. This caused the pipeline to behave as expected, where the splits looked like 8719-1-1, 1327-1-1, 2357-1-1, etc....
Why is it that the list
values of GroupByKey
are maintained, while those from GroupIntoBatches
are not? Am I missing something?
来源:https://stackoverflow.com/questions/65526982/why-does-groupintobatches-output-get-subdivided-when-input-to-next-transform