How can I Group By Month from a Date field using Python/Pandas

后端 未结 5 1919
孤独总比滥情好
孤独总比滥情好 2021-02-02 10:57

I have a Data-frame df which is as follows:

| date      | Revenue |
|-----------|---------|
| 6/2/2017  | 100     |
| 5/23/2017 | 200     |
| 5/20/2017 | 300             


        
相关标签:
5条回答
  • 2021-02-02 11:04

    try this:

    In [6]: df['date'] = pd.to_datetime(df['date'])
    
    In [7]: df
    Out[7]: 
            date  Revenue
    0 2017-06-02      100
    1 2017-05-23      200
    2 2017-05-20      300
    3 2017-06-22      400
    4 2017-06-21      500
    
    
    
    In [59]: df.groupby(df['date'].dt.strftime('%B'))['Revenue'].sum().sort_values()
    Out[59]: 
    date
    May      500
    June    1000
    
    0 讨论(0)
  • 2021-02-02 11:17

    Try this:

    1. Chaged the date column into datetime formate.

      ---> df['Date'] = pd.to_datetime(df['Date'])

    2. Insert new row in data frame which have month like->[May, 'June']

      ---> df['months'] = df['date'].apply(lambda x:x.strftime('%B'))

      ---> here x is date which take from date column in data frame.

    3. Now aggregate aggregate data on month column and sum the revenue.

      --->response_data_frame = df.groupby('months')['Revenue'].sum()

      ---->print(response_data_frame)

    output -:

    | month | Revenue |
    
    |-------|---------|
    
    | May   | 500     |
    
    | June  | 1000    |
    
    0 讨论(0)
  • 2021-02-02 11:20

    Try a groupby using a pandas Grouper:

    df = pd.DataFrame({'date':['6/2/2017','5/23/2017','5/20/2017','6/22/2017','6/21/2017'],'Revenue':[100,200,300,400,500]})
    df.date = pd.to_datetime(df.date)
    dg = df.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
    dg.index = dg.index.strftime('%B')
    
         Revenue
     May    500
    June    1000
    
    0 讨论(0)
  • 2021-02-02 11:22

    This will work better.

    Try this:

    #explicitly convert to date
    df['Date'] = pd.to_datetime(df['Date'])
    # set your date column as index 
    df.set_index('Date',inplace=True) 
    
    # For monthly use 'M', If needed for other freq you can change.
    df[revenue].resample('M').sum()
    

    This code gives same result as @shivsn answer on first post.

    But thing is we can do lot more operations in this mentioned code. Recommended to use this:

    >>> df['Date'] = pd.to_datetime(df['Date'])
    >>> df.set_index('Date',inplace=True)
    >>> df['withdrawal'].resample('M').sum().sort_values()
    Date
    2019-10-31     28710.00
    2019-04-30     31437.00
    2019-07-31     39728.00
    2019-11-30     40121.00
    2019-05-31     46495.00
    2020-02-29     57751.10
    2019-12-31     72469.13
    2020-01-31     76115.78
    2019-06-30     76947.00
    2019-09-30     79847.04
    2020-03-31     97920.18
    2019-08-31    205279.45
    Name: withdrawal, dtype: float64
    

    where @shivsn code's does same.

    >>> df.groupby(df['Date'].dt.strftime('%B'))['withdrawal'].sum().sort_values()
    Date
    October       28710.00
    April         31437.00
    July          39728.00
    November      40121.00
    May           46495.00
    February      57751.10
    December      72469.13
    January       76115.78
    June          76947.00
    September     79847.04
    March         97920.18
    August       205279.45
    Name: withdrawal, dtype: float64
    
    0 讨论(0)
  • 2021-02-02 11:26

    For DataFrame with many rows, using strftime takes up more time. If the date column already has dtype of datetime64[ns] (can use pd.to_datetime() to convert, or specify parse_dates during csv import, etc.), one can directly access datetime property for groupby labels (Method 3). The speedup is substantial.

    import numpy as np
    import pandas as pd
    
    T = pd.date_range(pd.Timestamp(0), pd.Timestamp.now()).to_frame(index=False)
    T = pd.concat([T for i in range(1,10)])
    T['revenue'] = pd.Series(np.random.randint(1000, size=T.shape[0]))
    T.columns.values[0] = 'date'
    
    print(T.shape) #(159336, 2)
    print(T.dtypes) #date: datetime64[ns], revenue: int32
    

    Method 1: strftime

    %timeit -n 10 -r 7 T.groupby(T['date'].dt.strftime('%B'))['revenue'].sum()
    

    1.47 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    Method 2: Grouper

    %timeit -n 10 -r 7 T.groupby(pd.Grouper(key='date', freq='1M')).sum()
    #NOTE Manually map months as integer {01..12} to strings
    

    56.9 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    Method 3: datetime properties

    %timeit -n 10 -r 7 T.groupby(T['date'].dt.month)['revenue'].sum()
    #NOTE Manually map months as integer {01..12} to strings
    

    34 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    0 讨论(0)
提交回复
热议问题