Flag Daylight Saving Time (DST) Hours in Pandas Date-Time Column

前端 未结 4 545
执笔经年
执笔经年 2021-01-20 04:50

I created an hourly dates dataframe, and now I would like to create a column that flags whether each row (hour) is in Daylight Saving Time or not. For example, in summer hou

4条回答
  •  孤街浪徒
    2021-01-20 05:07

    There's a nice link in the comments that at least let you do this manually. AFAIK, there isn't a vectorized way to do this.

    import pandas as pd
    import numpy as np
    from pytz import timezone
    
    # Generate data (as opposed to index)                                                                                                                                                                                  
    date_range = pd.to_datetime(pd.date_range('1/1/2018', '1/1/2019', freq='h', tz='America/Denver'))
    date_range = [date for date in date_range]
    
    # Localized dates dataframe                                                                                                                                                           
    df = pd.DataFrame(data=date_range, columns=['date_time'])
    
    # Map transition times to year for some efficiency gain                                                                                                                                                     
    tz = timezone('America/Denver')
    transition_times = tz._utc_transition_times[1:]
    transition_times = [t.astimezone(tz) for t in transition_times]
    transition_times_by_year = {}
    for start_time, stop_time in zip(transition_times[::2], transition_times[1::2]):
        year = start_time.year
        transition_times_by_year[year] = [start_time, stop_time]
    
    # If the date is in DST, mark true, else false                                                                                                                                                              
    def mark_dst(dates):
        for date in dates:
            start_dst, stop_dst = transition_times_by_year[date.year]
            yield start_dst <= date <= stop_dst
    df['dst_flag'] = [dst_flag for dst_flag in mark_dst(df['date_time'])]
    
    # Do a quick sanity check to make sure we did this correctly for year 2018                                                                                                                                  
    dst_start = df[df['dst_flag'] == True]['date_time'][0] # First dst time 2018
    dst_end = df[df['dst_flag'] == True]['date_time'][-1] # Last dst time 2018
    print(dst_start)
    print(dst_end)
    

    this outputs:

    2018-03-11 07:00:00-06:00
    2018-11-04 06:00:00-07:00
    

    which is likely correct. I didn't do the UTC conversions by hand or anything to check that the hours are exactly right for the given timezone. You can at least verify the dates are correct with a quick google search.

    Some gotchas:

    1. pd.date_range generates an index, not data. I changed your original code slightly to make it be data as opposed to the index. I assume you have the data already.

    2. There's something goofy about how tz._utc_transition_times is structured. It's start/stop utc DST transition times, but there is some goofy stuff in the early dates. It should be good from 1965 onward though. If you are doing dates earlier than that change tz._utc_transition_times[1:] to tz._utc_transition_times. Note not all years before 1965 are present.

    3. tz._utc_transition_times is "Python private". It is liable to change without warning or notice, and may or may not work for future or past versions of pytz. I'm using pytz verion 2017.3. I recommend you run this code to make sure the output matches, and if not, make sure to use version 2017.3.

    HTH, good luck with your research/regression problem!

提交回复
热议问题