Pandas groupby apply performing slow

前端 未结 2 1287
青春惊慌失措
青春惊慌失措 2021-02-05 03:42

I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this curr

2条回答
  •  独厮守ぢ
    2021-02-05 04:23

    I found the pandas locate commands (.loc or .iloc) were also slowing down the progress. By moving the sort out of the loop and converting the data to numpy arrays at the start of the function I got an even faster result. I am aware that the data is no longer a dataframe, but the indices returned in the list can be used to find the data in the original df.

    If there is any way to speed up the process even further I would appreciate the help. What I have so far:

    def Full_coverage(group):
    
        if len(group) > 1:
            group_index = group.index.values
            group = group.values
    
            # this condition is sufficient to find when the loop will add to the list
            if np.any(group[1:, 4] != group[:-1, 5]):
                start_km = group[0,4]
                end_km = group[0,5]
                end_km_index = group_index[0]
    
                for index, (i, j) in zip(group_index, group[1:,[4,5]]):
    
                    if i != end_km:
                        incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                            'Found startpoint: '+str(i)+' (row '+str(index)+')'))               
                    start_km = i
                    end_km = j
                    end_km_index = index
    
        return 0
    
    incomplete_coverage = []
    df.sort(['SectionStart','SectionStop'], ascending=True, inplace=True)
    cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
    %timeit df.groupby(cols).apply(Full_coverage)
    # 1 loops, best of 3: 272 ms per loop
    

提交回复
热议问题