问题
I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
回答1:
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
回答2:
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
回答3:
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)
来源:https://stackoverflow.com/questions/37498071/compare-each-pair-of-dates-in-two-columns-in-python-efficiently