问题
After passing the Pcollection to the next transform, the return/yield of the Transform is being multiplied, when I only need a single KV pair for a given street and accident count.
My understanding is that generators can assist in this, by holding values, but that only solves part of my problem. I've attempted to determine size prior to sending to next transform, but I haven't found any methods that give me true size of the Pcollection elements being passed.
class CountAccidents(beam.DoFn):
acci_dict = {}
def process(self, element):
if self.acci_dict.__contains__(element[0]['STREET_NAME']):
self.acci_dict[element[0]['STREET_NAME']] += 1
else:
self.acci_dict.update({element[0]['STREET_NAME']: 1})
if self.acci_dict != {}:
yield self.acci_dict
def run():
with beam.Pipeline() as pl:
test = (pl | 'Read' >> beam.io.ReadFromText('/modified_Excel_Crashes_Chicago.csv')
| 'Map Accident' >> beam.ParDo(AccidentstoDict())
| 'Count Accidents' >> beam.ParDo(CountAccidents())
| 'Print to Text' >> beam.io.WriteToText('/letstestthis', file_name_suffix='.txt'))```
Input Pcollection:
[{'CRASH_DATE': '3/25/19 0:25', 'WEATHER_CONDITION': 'CLEAR', 'STREET_NAME': 'KOSTNER AVE', 'CRASH_HOUR': '0'}]
[{'CRASH_DATE': '3/24/19 23:40', 'WEATHER_CONDITION': 'CLEAR', 'STREET_NAME': 'ARCHER AVE', 'CRASH_HOUR': '23'}]
[{'CRASH_DATE': '3/24/19 23:30', 'WEATHER_CONDITION': 'UNKNOWN', 'STREET_NAME': 'VAN BUREN ST', 'CRASH_HOUR': '23'}]
I expect to get this:
{'KILPATRICK AVE': 1, 'MILWAUKEE AVE': 1, 'CENTRAL AVE': 2, 'WESTERN AVE': 6, 'DANTE AVE': 1}
What I get is this(a slow build-up till complete):
{'KOSTNER AVE': 1}
{'KOSTNER AVE': 1, 'ARCHER AVE': 1}
{'KOSTNER AVE': 2, 'ARCHER AVE': 2, 'VAN BUREN ST': 1}
回答1:
You will need to do a combine per key, for Count you can make use of the one here:
https://beam.apache.org/releases/pydoc/2.9.0/apache_beam.transforms.combiners.html
After your read operation, output a KeyValue which is {STREET,1} followed by a Count per key transform which will give you the global count for the street.
From there it would be easy to also add Windowing functions if you wanted the output per week for example. You will just need to add the timestamp and the window into call. Example of how to do that is here:
In a batch pipeline how do I assign timestamps to data from the batch sources for example csv files in a Beam pipeline
来源:https://stackoverflow.com/questions/55445851/how-can-i-stop-the-extra-repetition-in-the-return-yield-while-still-keeping-the