I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this curr
I found the pandas locate commands (.loc or .iloc) were also slowing down the progress. By moving the sort out of the loop and converting the data to numpy arrays at the start of the function I got an even faster result. I am aware that the data is no longer a dataframe, but the indices returned in the list can be used to find the data in the original df.
If there is any way to speed up the process even further I would appreciate the help. What I have so far:
def Full_coverage(group):
if len(group) > 1:
group_index = group.index.values
group = group.values
# this condition is sufficient to find when the loop will add to the list
if np.any(group[1:, 4] != group[:-1, 5]):
start_km = group[0,4]
end_km = group[0,5]
end_km_index = group_index[0]
for index, (i, j) in zip(group_index, group[1:,[4,5]]):
if i != end_km:
incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
'Found startpoint: '+str(i)+' (row '+str(index)+')'))
start_km = i
end_km = j
end_km_index = index
return 0
incomplete_coverage = []
df.sort(['SectionStart','SectionStop'], ascending=True, inplace=True)
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 272 ms per loop