I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first
pandas.to_datetime
is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format
parameter, which will greatly improve the speed.
Here's an example:
import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})
%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's 700x slower. Now specify the format explicitly:
%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas
is still parsing the second date format more slowly, but it's not nearly as bad as it was before.
Edit: as of pd.__version__ == '1.0.5'
the automatic parsing seems to have gotten much better for what used to be extremely slowly parsed formats, likely due to the implemenation of this performance improvement in pd.__version == '0.25.0'
import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})
%timeit pd.to_datetime(df1.Timestamp)
#9.01 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.to_datetime(df2.Timestamp)
#9.1 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)