Add missing dates to pandas dataframe

前端 未结 5 2119
春和景丽
春和景丽 2020-11-22 09:47

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series do

相关标签:
5条回答
  • 2020-11-22 10:12

    You could use Series.reindex:

    import pandas as pd
    
    idx = pd.date_range('09-01-2013', '09-30-2013')
    
    s = pd.Series({'09-02-2013': 2,
                   '09-03-2013': 10,
                   '09-06-2013': 5,
                   '09-07-2013': 1})
    s.index = pd.DatetimeIndex(s.index)
    
    s = s.reindex(idx, fill_value=0)
    print(s)
    

    yields

    2013-09-01     0
    2013-09-02     2
    2013-09-03    10
    2013-09-04     0
    2013-09-05     0
    2013-09-06     5
    2013-09-07     1
    2013-09-08     0
    ...
    
    0 讨论(0)
  • 2020-11-22 10:19

    Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

    def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
    
        df.set_index(date_col_name,drop=True,inplace=True)
        df.index = pd.DatetimeIndex(df.index)
        d = datetime.now().date()
        d2 = d - timedelta(days = days_back)
        idx = pd.date_range(d2, d, freq = "D")
        df = df.reindex(idx,fill_value=fill_value)
        df[date_col_name] = pd.DatetimeIndex(df.index)
    
        return df
    
    0 讨论(0)
  • 2020-11-22 10:22

    One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

    df = pd.DataFrame({
        'timestamps': pd.to_datetime(
            ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
        'values':['a','b','c','d']})
    df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
    df
    

    yields

                timestamps             values
    2016-11-15  "2016-11-15 01:00:00"  a
    2016-11-16  "2016-11-16 02:00:00"  b
    2016-11-16  "2016-11-16 03:00:00"  c
    2016-11-18  "2016-11-18 04:00:00"  d
    

    Due to the duplicate 2016-11-16 date, an attempt to reindex:

    all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
    df.reindex(all_days)
    

    fails with:

    ...
    ValueError: cannot reindex from a duplicate axis
    

    (by this it means the index has duplicates, not that it is itself a dup)

    Instead, we can use .loc to look up entries for all dates in range:

    df.loc[all_days]
    

    yields

                timestamps             values
    2016-11-15  "2016-11-15 01:00:00"  a
    2016-11-16  "2016-11-16 02:00:00"  b
    2016-11-16  "2016-11-16 03:00:00"  c
    2016-11-17  NaN                    NaN
    2016-11-18  "2016-11-18 04:00:00"  d
    

    fillna can be used on the column series to fill blanks if needed.

    0 讨论(0)
  • 2020-11-22 10:34

    An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

    df.resample('D').mean()
    

    resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

    Here is the original data, but with an extra entry for '2013-09-03':

                 val
    date           
    2013-09-02     2
    2013-09-03    10
    2013-09-03    20    <- duplicate date added to OP's data
    2013-09-06     5
    2013-09-07     1
    

    And here are the results:

                 val
    date            
    2013-09-02   2.0
    2013-09-03  15.0    <- mean of original values for 2013-09-03
    2013-09-04   NaN    <- NaN b/c date not present in orig
    2013-09-05   NaN    <- NaN b/c date not present in orig
    2013-09-06   5.0
    2013-09-07   1.0
    

    I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.

    0 讨论(0)
  • 2020-11-22 10:36

    A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

    # "broken" (staggered) dates
    dates = pd.Index([pd.Timestamp('2012-05-01'), 
                      pd.Timestamp('2012-05-04'), 
                      pd.Timestamp('2012-05-06')])
    s = pd.Series([1, 2, 3], dates)
    
    print(s.asfreq('D'))
    2012-05-01    1.0
    2012-05-02    NaN
    2012-05-03    NaN
    2012-05-04    2.0
    2012-05-05    NaN
    2012-05-06    3.0
    Freq: D, dtype: float64
    
    0 讨论(0)
提交回复
热议问题