问题
I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first column is the timestamp.
- In the first file, the pattern is the following: 2018-09-24 15:38
- In the second file, the pattern is the following: 2018-09-24 03:38:06 PM
In both cases, the command is the same:
data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
I check the execution time for both lines:
- pd.read is as effective in both cases
- it takes ~3 to 4 seconds more to execute the second line of the code
The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?
回答1:
pandas.to_datetime
is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format
parameter, which will greatly improve the speed.
Here's an example:
import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})
%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's 700x slower. Now specify the format explicitly:
%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas
is still parsing the second date format more slowly, but it's not nearly as bad as it was before.
来源:https://stackoverflow.com/questions/52480839/slow-pd-to-datetime