问题
From an online API I gather a series of data points, each with a value and an ISO timestamp. Unfortunately I need to loop over them, so I store them in a temporary dict
and then create a pandas dataframe from that and set the index to the timestamp column (simplified example):
from datetime import datetime
import pandas
input_data = [
'2019-09-16T06:44:01+02:00',
'2019-11-11T09:13:01+01:00',
]
data = []
for timestamp in input_data:
_date = datetime.fromisoformat(timestamp)
data.append({'time': _date})
pd_data = pandas.DataFrame(data).set_index('time')
As long as all timestamps are in the same timezone and DST/non-DST everything works fine, and, I get a Dataframe with a DatetimeIndex
which I can work on later.
However, once two different time-offsets appear in one dataset (above example), I only get an Index
, in my dataframe, which does not support any time-based methods.
Is there any way to make pandas accept timezone-aware, differing date as index?
回答1:
- A pandas
datetime
column also requires the offset to be the same. A column with different offsets, will not be converted to adatetime
dtype. - I suggest, do not convert the data to a datetime until it's in pandas.
- Separate the time offset, and treat it as a
timedelta
to_timedelta
requires a format of'hh:mm:ss'
so add':00'
to the end of the offset- See Pandas: Time deltas for all the available timedelta operations
- pandas.Series.dt.tz_convert
- pandas.Series.tz_localize
- Convert to a specific TZ with:
- If a datetime is not
datetime64[ns, UTC]
dtype, then first use.dt.tz_localize('UTC')
before.dt.tz_convert('US/Pacific')
- Otherwise
df.datetime_utc.dt.tz_convert('US/Pacific')
- If a datetime is not
import pandas as pd
# sample data
input_data = ['2019-09-16T06:44:01+02:00', '2019-11-11T09:13:01+01:00']
# dataframe
df = pd.DataFrame(input_data, columns=['datetime'])
# separate the offset from the datetime and convert it to a timedelta
df['offset'] = pd.to_timedelta(df.datetime.str[-6:] + ':00')
# if desired, create a str with the separated datetime
# converting this to a datetime will lead to AmbiguousTimeError because of overlapping datetimes at 2AM, per the OP
df['datetime_str'] = df.datetime.str[:-6]
# convert the datetime column to a datetime format without the offset
df['datetime_utc'] = pd.to_datetime(df.datetime, utc=True)
# display(df)
datetime offset datetime_str datetime_utc
0 2019-09-16T06:44:01+02:00 0 days 02:00:00 2019-09-16 06:44:01 2019-09-16 04:44:01+00:00
1 2019-11-11T09:13:01+01:00 0 days 01:00:00 2019-11-11 09:13:01 2019-11-11 08:13:01+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 2 non-null object
1 offset 2 non-null timedelta64[ns]
2 datetime_str 2 non-null object
3 datetime_utc 2 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(2), timedelta64[ns](1)
memory usage: 192.0+ bytes
# convert to local timezone
df.datetime_utc.dt.tz_convert('US/Pacific')
[out]:
0 2019-09-15 21:44:01-07:00
1 2019-11-11 00:13:01-08:00
Name: datetime_utc, dtype: datetime64[ns, US/Pacific]
Other Resources
- Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes.
- Talk Python to Me: Episode #271: Unlock the mysteries of time, Python's datetime that is!
- Real Python: Using Python datetime to Work With Dates and Times
- The dateutil module provides powerful extensions to the standard datetime module.
回答2:
A minor correction of the question's wording (which I think is important). What you have are UTC offsets - DST/no-DST would require more information than that, i.e. a time zone. Here, this matters since you can parse timestamps with UTC offsets (even different ones) to UTC easily:
import pandas as pd
input_data = [
'2019-09-16T06:44:01+02:00',
'2019-11-11T09:13:01+01:00',
]
dti = pd.to_datetime(input_data, utc=True)
# dti
# DatetimeIndex(['2019-09-16 04:44:01+00:00', '2019-11-11 08:13:01+00:00'], dtype='datetime64[ns, UTC]', freq=None)
I always prefer to work with UTC so I'd be fine with that. If however you really need datetime in a certain time zone, you can convert e.g. as
dti = dti.tz_convert('Europe/Berlin')
# dti
# DatetimeIndex(['2019-09-16 06:44:01+02:00', '2019-11-11 09:13:01+01:00'], dtype='datetime64[ns, Europe/Berlin]', freq=None)
回答3:
I'm unaware of a way to use timezone aware datetimes as the index and get a datetime index in pandas. I do have a suggestion that might help depending on what is required out of your data though.
Would it be acceptable to convert the datetime objects to the same timezone, or is the timezone information something that must be retained? If you do require the timezone but not necessarily with the index, While looping through the dates you can store a new column with the old timezone or have a duplicate of the original time from the timezone in a new column so it can still be accessed.
来源:https://stackoverflow.com/questions/63495502/creating-pandas-datetimeindex-in-dataframe-from-dst-aware-datetime-objects