Pandas groupby apply performing slow

前端 未结 2 1289
青春惊慌失措
青春惊慌失措 2021-02-05 03:42

I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this curr

2条回答
  •  粉色の甜心
    2021-02-05 04:33

    The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified. You could probably use a vectorized operation rather than a for loop in your function to save time, but a much easier way to shave off a few seconds is to return 0 rather than return group. When you return group, pandas will actually create a new data object combining your sorted groups, which you don't appear to use. When you return 0, pandas will combine 5300 zeros instead, which is much faster.

    For example:

    cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
    groups = df.groupby(cols)
    print(len(groups))
    # 5353
    
    %timeit df.groupby(cols).apply(lambda group: group)
    # 1 loops, best of 3: 2.41 s per loop
    
    %timeit df.groupby(cols).apply(lambda group: 0)
    # 10 loops, best of 3: 64.3 ms per loop
    

    Just combining the results you don't use is taking about 2.4 seconds; the rest of the time is actual computation in your loop which you should attempt to vectorize.


    Edit:

    With a quick additional vectorized check before the for loop and returning 0 instead of group, I got the time down to about ~2sec, which is basically the cost of sorting each group. Try this function:

    def Full_coverage(group):
        if len(group) > 1:
            group = group.sort('SectionStart', ascending=True)
    
            # this condition is sufficient to find when the loop
            # will add to the list
            if np.any(group.values[1:, 4] != group.values[:-1, 5]):
                start_km = group.iloc[0,4]
                end_km = group.iloc[0,5]
                end_km_index = group.index[0]
    
                for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
                    if i != end_km:
                        incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                            'Found startpoint: '+str(i)+' (row '+str(index)+')'))                
                    start_km = i
                    end_km = j
                    end_km_index = index
    
        return 0
    
    cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
    %timeit df.groupby(cols).apply(Full_coverage)
    # 1 loops, best of 3: 1.74 s per loop
    

    Edit 2: here's an example which incorporates my suggestion to move the sort outside the groupby and to remove the unnecessary loops. Removing the loops is not much faster for the given example, but will be faster if there are a lot of incompletes:

    def Full_coverage_new(group):
        if len(group) > 1:
            mask = group.values[1:, 4] != group.values[:-1, 5]
            if np.any(mask):
                err = ('Expected startpoint: {0} (row {1}) '
                       'Found startpoint: {2} (row {3})')
                incomplete_coverage.extend([err.format(group.iloc[i, 5],
                                                       group.index[i],
                                                       group.iloc[i + 1, 4],
                                                       group.index[i + 1])
                                            for i in np.where(mask)[0]])
        return 0
    
    incomplete_coverage = []
    cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
    df_s = df.sort_values(['SectionStart','SectionStop'])
    df_s.groupby(cols).apply(Full_coverage_nosort)
    

提交回复
热议问题