问题
I'm trying to understand the combine transformer in a apache beam pipeline.
Considering the following example pipeline:
def test_combine(data):
logging.info('test combine')
logging.info(type(data))
logging.info(data)
return [1, 2, 3]
def run():
logging.info('start pipeline')
pipeline_options = PipelineOptions(
None, streaming=True, save_main_session=True,
)
p = beam.Pipeline(options=pipeline_options)
data = p | beam.Create([
{'id': '1', 'ts': datetime.datetime.utcnow()},
{'id': '2', 'ts': datetime.datetime.utcnow()},
{'id': '3', 'ts': datetime.datetime.utcnow()}
])
purchase_paths = (
data
| WindowInto(FixedWindows(10))
| beam.CombineGlobally(test_combine).without_defaults()
)
result = p.run()
result.wait_until_finish()
logging.info('end pipeline')
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
Generates the following logging output:
INFO:root:test combine
INFO:root:<class 'list'>
INFO:root:[{'id': '1', 'ts': datetime.datetime(2020, 8, 3, 19, 22, 53, 193363)}, {'id': '2', 'ts': datetime.datetime(2020, 8, 3, 19, 22, 53, 193366)}, {'id': '3', 'ts': datetime.datetime(2020, 8, 3, 19, 22, 53, 193367)}]
INFO:root:test combine
INFO:root:<class 'apache_beam.transforms.core._ReiterableChain'>
INFO:root:<apache_beam.transforms.core._ReiterableChain object at 0x1210faf50>
INFO:root:test combine
INFO:root:<class 'list'>
INFO:root:[[1, 2, 3]]
INFO:root:end pipeline
Why is the combine function called three times and receives every time a different input? In the last call it seems to receive the own return value as input.
Update
I had a wrong understanding from the combiner. In the documentation is says:
The combining function should be commutative and associative, as the function is not necessarily invoked exactly once on all values with a given key
Indeed the output of the combiner can be used again as input for the combiner to aggregate with the following items of the pcollection. Thus the output of the combiner needs to be in the same format as the input of the combiner.
Also as Inigo pointed out I needed to set the timestamp value in the pcollection so that the windowing works properly.
This is the updated example:
combine_count = 0
def test_combine(data):
global combine_count
combine_count += 1
logging.info(f'test combine: {combine_count}')
logging.info(f'input: {list(data)}')
combined_id = '+'.join([d['id'] for d in data])
combined_ts = max([d['ts'] for d in data])
combined = {'id': combined_id, 'ts': combined_ts}
logging.info(f'output: {combined}')
return combined
def run():
logging.info('start pipeline')
pipeline_options = PipelineOptions(
None, streaming=True, save_main_session=True,
)
p = beam.Pipeline(options=pipeline_options)
ts = int(time.time())
data = p | beam.Create([
{'id': '1', 'ts': ts},
{'id': '2', 'ts': ts + 5},
{'id': '3', 'ts': ts + 12}
])
purchase_paths = (
data
| 'With timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['ts']))
| WindowInto(FixedWindows(10))
| beam.CombineGlobally(test_combine).without_defaults()
)
result = p.run()
result.wait_until_finish()
logging.info('end pipeline')
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
The output of this example looks like this:
INFO:root:test combine: 1
INFO:root:input: [{'id': '2', 'ts': 1596791192}, {'id': '3', 'ts': 1596791199}]
INFO:root:output: {'id': '2+3', 'ts': 1596791199}
INFO:apache_beam.runners.portability.fn_api_runner.fn_runner:Running (((CombineGlobally(test_combine)/CombinePerKey/Group/Read)+(CombineGlobally(test_combine)/CombinePerKey/Merge))+(CombineGlobally(test_combine)/CombinePerKey/ExtractOutputs))+(ref_AppliedPTransform_CombineGlobally(test_combine)/UnKey_28)
INFO:root:test combine: 2
INFO:root:input: [{'id': '1', 'ts': 1596791187}]
INFO:root:output: {'id': '1', 'ts': 1596791187}
INFO:root:test combine: 3
INFO:root:input: [{'id': '1', 'ts': 1596791187}]
INFO:root:output: {'id': '1', 'ts': 1596791187}
INFO:root:test combine: 4
INFO:root:input: [{'id': '2+3', 'ts': 1596791199}]
INFO:root:output: {'id': '2+3', 'ts': 1596791199}
INFO:root:test combine: 5
INFO:root:input: [{'id': '2+3', 'ts': 1596791199}]
INFO:root:output: {'id': '2+3', 'ts': 1596791199}
INFO:root:end pipeline
I still don't fully understand why the combiner is called that many times. But according to the documentation this may happen.
回答1:
It looks it's happening due to the MapReduce structure. When using Combiners, the output that one combiner has is used as a input.
As an example, imagine summing 3 numbers (1, 2, 3). The combiner MAY sum first 1 and 2 (3) and use that number as input with 3 (3 + 3 = 6). In your case [1, 2, 3]
seems to be used as an input in the next combiner.
An example that really helped me understand this:
p = beam.Pipeline()
def make_list(elements):
print(elements)
return elements
(p | Create(range(30))
| beam.core.CombineGlobally(make_list))
p.run()
See that the element [1,..,10]
is used in the next combiner.
来源:https://stackoverflow.com/questions/63235815/why-is-the-combine-function-called-three-times