How can I stop the extra repetition in the return/yield, while still keeping the running totals for a given key: value pair?

安稳与你 提交于 2019-12-11 17:38:32

问题


After passing the Pcollection to the next transform, the return/yield of the Transform is being multiplied, when I only need a single KV pair for a given street and accident count.

My understanding is that generators can assist in this, by holding values, but that only solves part of my problem. I've attempted to determine size prior to sending to next transform, but I haven't found any methods that give me true size of the Pcollection elements being passed.

class CountAccidents(beam.DoFn):
    acci_dict = {}

    def process(self, element):
        if self.acci_dict.__contains__(element[0]['STREET_NAME']):
            self.acci_dict[element[0]['STREET_NAME']] += 1
        else:
            self.acci_dict.update({element[0]['STREET_NAME']: 1})
        if self.acci_dict != {}:
            yield self.acci_dict


def run():
    with beam.Pipeline() as pl:
        test = (pl | 'Read' >> beam.io.ReadFromText('/modified_Excel_Crashes_Chicago.csv')
                   | 'Map Accident' >> beam.ParDo(AccidentstoDict())
                   | 'Count Accidents' >> beam.ParDo(CountAccidents())
                   | 'Print to Text' >> beam.io.WriteToText('/letstestthis', file_name_suffix='.txt'))```                                                      

Input Pcollection:
[{'CRASH_DATE': '3/25/19 0:25', 'WEATHER_CONDITION': 'CLEAR', 'STREET_NAME': 'KOSTNER AVE', 'CRASH_HOUR': '0'}]
[{'CRASH_DATE': '3/24/19 23:40', 'WEATHER_CONDITION': 'CLEAR', 'STREET_NAME': 'ARCHER AVE', 'CRASH_HOUR': '23'}]
[{'CRASH_DATE': '3/24/19 23:30', 'WEATHER_CONDITION': 'UNKNOWN', 'STREET_NAME': 'VAN BUREN ST', 'CRASH_HOUR': '23'}]

I expect to get this: 
{'KILPATRICK AVE': 1, 'MILWAUKEE AVE': 1, 'CENTRAL AVE': 2, 'WESTERN AVE': 6, 'DANTE AVE': 1}

What I get is this(a slow build-up till complete): 
{'KOSTNER AVE': 1}
{'KOSTNER AVE': 1, 'ARCHER AVE': 1}
{'KOSTNER AVE': 2, 'ARCHER AVE': 2, 'VAN BUREN ST': 1}

回答1:


You will need to do a combine per key, for Count you can make use of the one here:

https://beam.apache.org/releases/pydoc/2.9.0/apache_beam.transforms.combiners.html

After your read operation, output a KeyValue which is {STREET,1} followed by a Count per key transform which will give you the global count for the street.

From there it would be easy to also add Windowing functions if you wanted the output per week for example. You will just need to add the timestamp and the window into call. Example of how to do that is here:

In a batch pipeline how do I assign timestamps to data from the batch sources for example csv files in a Beam pipeline



来源:https://stackoverflow.com/questions/55445851/how-can-i-stop-the-extra-repetition-in-the-return-yield-while-still-keeping-the

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!