Pandas- adding missing dates to DataFrame while keeping column/index values?

后端 未结 4 1686
谎友^
谎友^ 2021-01-22 16:23

I have a pandas dataframe that incorporates dates, customers, items, and then dollar value for purchases.

   date     customer   product   amt  
 1/1/2017   tim         


        
相关标签:
4条回答
  • 2021-01-22 16:36

    Maybe because of my SQL mindset, consider a left join merge on an expanded helper dataframe:

    helper_df_list = [pd.DataFrame({'date': pd.date_range(df['date'].min(), df['date'].max()), 
                                    'customer': c, 'product': p }) 
                        for c in df['customer'].unique() 
                                for p in df['product'].unique()]
    
    helper_df = pd.concat(helper_df_list, ignore_index=True)
    
    final_df = pd.merge(helper_df, df, on=['date', 'customer', 'product'], how='left')\
                        .fillna(0).sort_values(['date', 'customer']).reset_index(drop=True)
    

    Output

    print(final_df)
    #    customer       date product  amt
    # 0       jim 2017-01-01   apple  0.0
    # 1       jim 2017-01-01   melon  2.0
    # 2       jim 2017-01-01  orange  0.0
    # 3       tim 2017-01-01   apple  3.0
    # 4       tim 2017-01-01   melon  0.0
    # 5       tim 2017-01-01  orange  0.0
    # 6       tom 2017-01-01   apple  5.0
    # 7       tom 2017-01-01   melon  4.0
    # 8       tom 2017-01-01  orange  0.0
    # 9       jim 2017-01-02   apple  0.0
    # 10      jim 2017-01-02   melon  0.0
    # 11      jim 2017-01-02  orange  0.0
    # 12      tim 2017-01-02   apple  0.0
    # 13      tim 2017-01-02   melon  0.0
    # 14      tim 2017-01-02  orange  0.0
    # 15      tom 2017-01-02   apple  0.0
    # 16      tom 2017-01-02   melon  0.0
    # 17      tom 2017-01-02  orange  0.0
    # 18      jim 2017-01-03   apple  0.0
    # 19      jim 2017-01-03   melon  0.0
    # 20      jim 2017-01-03  orange  0.0
    # 21      tim 2017-01-03   apple  0.0
    # 22      tim 2017-01-03   melon  0.0
    # 23      tim 2017-01-03  orange  0.0
    # 24      tom 2017-01-03   apple  0.0
    # 25      tom 2017-01-03   melon  0.0
    # 26      tom 2017-01-03  orange  0.0
    # 27      jim 2017-01-04   apple  2.0
    # 28      jim 2017-01-04   melon  0.0
    # 29      jim 2017-01-04  orange  0.0
    # 30      tim 2017-01-04   apple  0.0
    # 31      tim 2017-01-04   melon  3.0
    # 32      tim 2017-01-04  orange  0.0
    # 33      tom 2017-01-04   apple  0.0
    # 34      tom 2017-01-04   melon  1.0
    # 35      tom 2017-01-04  orange  4.0
    
    0 讨论(0)
  • 2021-01-22 16:47

    Notice ,this using the stack and unstack couple of times

    df.set_index(['date','customer','product']).amt.unstack(-3).\
      reindex(columns=pd.date_range(df['date'].min(), 
        df['date'].max()),fill_value=0).\
          stack(dropna=False).unstack().stack(dropna=False).\
            unstack('customer').stack(dropna=False).reset_index().\
              fillna(0).sort_values(['level_1','customer','product'])
    Out[314]: 
       product    level_1 customer    0
    0    apple 2017-01-01      jim  0.0
    12   melon 2017-01-01      jim  2.0
    24  orange 2017-01-01      jim  0.0
    1    apple 2017-01-01      tim  3.0
    13   melon 2017-01-01      tim  0.0
    25  orange 2017-01-01      tim  0.0
    2    apple 2017-01-01      tom  5.0
    14   melon 2017-01-01      tom  4.0
    26  orange 2017-01-01      tom  0.0
    3    apple 2017-01-02      jim  0.0
    15   melon 2017-01-02      jim  0.0
    27  orange 2017-01-02      jim  0.0
    4    apple 2017-01-02      tim  0.0
    16   melon 2017-01-02      tim  0.0
    28  orange 2017-01-02      tim  0.0
    5    apple 2017-01-02      tom  0.0
    17   melon 2017-01-02      tom  0.0
    29  orange 2017-01-02      tom  0.0
    6    apple 2017-01-03      jim  0.0
    18   melon 2017-01-03      jim  0.0
    30  orange 2017-01-03      jim  0.0
    7    apple 2017-01-03      tim  0.0
    19   melon 2017-01-03      tim  0.0
    31  orange 2017-01-03      tim  0.0
    8    apple 2017-01-03      tom  0.0
    20   melon 2017-01-03      tom  0.0
    32  orange 2017-01-03      tom  0.0
    9    apple 2017-01-04      jim  2.0
    21   melon 2017-01-04      jim  0.0
    33  orange 2017-01-04      jim  0.0
    10   apple 2017-01-04      tim  0.0
    22   melon 2017-01-04      tim  3.0
    34  orange 2017-01-04      tim  0.0
    11   apple 2017-01-04      tom  0.0
    23   melon 2017-01-04      tom  1.0
    35  orange 2017-01-04      tom  4.0
    
    0 讨论(0)
  • 2021-01-22 16:50

    Let's use product from itertools, pd.date_range, and merge:

    from itertools import product
    
    daterange = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
    d1 = pd.DataFrame(list(product(daterange, 
                                   df['customer'].unique(),
                                   df['product'].unique())), 
                      columns=['date', 'customer', 'product'])
    d1.merge(df, on=['date', 'customer', 'product'], how='left').fillna(0)
    

    Output:

             date customer product  amt
    0  2017-01-01      tim   apple  3.0
    1  2017-01-01      tim   melon  0.0
    2  2017-01-01      tim  orange  0.0
    3  2017-01-01      jim   apple  0.0
    4  2017-01-01      jim   melon  2.0
    5  2017-01-01      jim  orange  0.0
    6  2017-01-01      tom   apple  5.0
    7  2017-01-01      tom   melon  4.0
    8  2017-01-01      tom  orange  0.0
    9  2017-01-02      tim   apple  0.0
    10 2017-01-02      tim   melon  0.0
    11 2017-01-02      tim  orange  0.0
    12 2017-01-02      jim   apple  0.0
    13 2017-01-02      jim   melon  0.0
    14 2017-01-02      jim  orange  0.0
    15 2017-01-02      tom   apple  0.0
    16 2017-01-02      tom   melon  0.0
    17 2017-01-02      tom  orange  0.0
    18 2017-01-03      tim   apple  0.0
    19 2017-01-03      tim   melon  0.0
    20 2017-01-03      tim  orange  0.0
    21 2017-01-03      jim   apple  0.0
    22 2017-01-03      jim   melon  0.0
    23 2017-01-03      jim  orange  0.0
    24 2017-01-03      tom   apple  0.0
    25 2017-01-03      tom   melon  0.0
    26 2017-01-03      tom  orange  0.0
    27 2017-01-04      tim   apple  0.0
    28 2017-01-04      tim   melon  3.0
    29 2017-01-04      tim  orange  0.0
    30 2017-01-04      jim   apple  2.0
    31 2017-01-04      jim   melon  0.0
    32 2017-01-04      jim  orange  0.0
    33 2017-01-04      tom   apple  0.0
    34 2017-01-04      tom   melon  1.0
    35 2017-01-04      tom  orange  4.0
    
    0 讨论(0)
  • 2021-01-22 16:57

    IIUC you can do it this way:

    In [63]: dates = pd.date_range(df['date'].min(), df['date'].max())
    
    In [64]: idx = pd.MultiIndex.from_product((dates,
                                               df['customer'].unique(), 
                                               df['product'].unique()))
    
    In [72]: (df.set_index(['date','customer','product'])
                .reindex(idx, fill_value=0)
                .reset_index()
                .set_axis(df.columns, axis=1, inplace=False))
    Out[72]:
             date customer product  amt
    0  2017-01-01      tim   apple    3
    1  2017-01-01      tim   melon    0
    2  2017-01-01      tim  orange    0
    3  2017-01-01      jim   apple    0
    4  2017-01-01      jim   melon    2
    5  2017-01-01      jim  orange    0
    6  2017-01-01      tom   apple    5
    7  2017-01-01      tom   melon    4
    8  2017-01-01      tom  orange    0
    9  2017-01-02      tim   apple    0
    ..        ...      ...     ...  ...
    26 2017-01-03      tom  orange    0
    27 2017-01-04      tim   apple    0
    28 2017-01-04      tim   melon    3
    29 2017-01-04      tim  orange    0
    30 2017-01-04      jim   apple    2
    31 2017-01-04      jim   melon    0
    32 2017-01-04      jim  orange    0
    33 2017-01-04      tom   apple    0
    34 2017-01-04      tom   melon    1
    35 2017-01-04      tom  orange    4
    
    [36 rows x 4 columns]
    
    0 讨论(0)
提交回复
热议问题