Pandas groupby month and year

后端 未结 4 1386
闹比i
闹比i 2020-11-27 11:18

I have the following dataframe:

Date        abc    xyz
01-Jun-13   100    200
03-Jun-13   -20    50
15-Aug-13   40     -5
20-Jan-14   25     15
21-Feb-14   6         


        
相关标签:
4条回答
  • 2020-11-27 11:36

    There are different ways to do that.

    • I created the data frame to showcase the different techniques to filter your data.
    df = pd.DataFrame({'Date':['01-Jun-13','03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
    

    'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })

    • I separated months/year/day and seperated month-year as you explained.
    def getMonth(s):
      return s.split("-")[1]
    
    def getDay(s):
      return s.split("-")[0]
    
    def getYear(s):
      return s.split("-")[2]
    
    def getYearMonth(s):
      return s.split("-")[1]+"-"+s.split("-")[2]
    
    • I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth
    df['year']= df['Date'].apply(lambda x: getYear(x))
    df['month']= df['Date'].apply(lambda x: getMonth(x))
    df['day']= df['Date'].apply(lambda x: getDay(x))
    df['YearMonth']= df['Date'].apply(lambda x: getYearMonth(x))
    

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    4  21-Feb-14   60   80   14   Feb  21    Feb-14
    
    • You can go through the different groups in groupby(..) items.

    In this case, we are grouping by two columns:

    for key,g in df.groupby(['year','month']):
        print key,g
    

    Output:

    ('13', 'Jun')         Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    ('13', 'Aug')         Date  abc  xyz year month day YearMonth
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    ('14', 'Jan')         Date  abc  xyz year month day YearMonth
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    ('14', 'Feb')         Date  abc  xyz year month day YearMonth
    

    In this case, we are grouping by one column:

    for key,g in df.groupby(['YearMonth']):
        print key,g
    

    Output:

    Jun-13         Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    Aug-13         Date  abc  xyz year month day YearMonth
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    Jan-14         Date  abc  xyz year month day YearMonth
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    Feb-14         Date  abc  xyz year month day YearMonth
    4  21-Feb-14   60   80   14   Feb  21    Feb-14
    
    • In case you wanna access to specific item, you can use get_group

    print df.groupby(['YearMonth']).get_group('Jun-13')

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    
    • Similar to get_group. This hack would help to filter values and get the grouped values.

    This also would give the same result.

    print df[df['YearMonth']=='Jun-13'] 
    

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    

    You can select list of abc or xyz values during Jun-13

    print df[df['YearMonth']=='Jun-13'].abc.values
    print df[df['YearMonth']=='Jun-13'].xyz.values
    

    Output:

    [100 -20]  #abc values
    [200  50]  #xyz values
    

    You can use this to go through the dates that you have classified as "year-month" and apply cretiria on it to get related data.

    for x in set(df.YearMonth): 
        print df[df['YearMonth']==x].abc.values
        print df[df['YearMonth']==x].xyz.values
    

    I recommend also to check this answer as well.

    0 讨论(0)
  • 2020-11-27 11:49

    Why not keep it simple?!

    GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()
    

    giving you,

    print(GB)
            abc  xyz
    2013 6   80  250
         8   40   -5
    2014 1   25   15
         2   60   80
    

    and then you can plot like asked using,

    GB.plot('abc','xyz',kind='scatter')
    
    0 讨论(0)
  • 2020-11-27 11:50

    You can also do it by creating a string column with the year and month as follows:

    df['date'] = df.index
    df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
    grouped = df.groupby('year-month')
    

    However this doesn't preserve the order when you loop over the groups, e.g.

    for name, group in grouped:
        print(name)
    

    Will give:

    2007 11
    2007 12
    2008 1
    2008 10
    2008 11
    2008 12
    2008 2
    2008 3
    2008 4
    2008 5
    2008 6
    2008 7
    2008 8
    2008 9
    2009 1
    2009 10
    

    So then, if you want to preserve the order, you must do as suggested by @Q-man above:

    grouped = df.groupby([df.index.year, df.index.month])
    

    This will preserve the order in the above loop:

    (2007, 11)
    (2007, 12)
    (2008, 1)
    (2008, 2)
    (2008, 3)
    (2008, 4)
    (2008, 5)
    (2008, 6)
    (2008, 7)
    (2008, 8)
    (2008, 9)
    (2008, 10)
    
    0 讨论(0)
  • 2020-11-27 12:01

    You can use either resample or Grouper (which resamples under the hood).

    First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:

    In [11]: df1
    Out[11]:
                abc  xyz
    Date
    2013-06-01  100  200
    2013-06-03  -20   50
    2013-08-15   40   -5
    2014-01-20   25   15
    2014-02-21   60   80
    
    In [12]: g = df1.groupby(pd.Grouper(freq="M"))  # DataFrameGroupBy (grouped by Month)
    
    In [13]: g.sum()
    Out[13]:
                abc  xyz
    Date
    2013-06-30   80  250
    2013-07-31  NaN  NaN
    2013-08-31   40   -5
    2013-09-30  NaN  NaN
    2013-10-31  NaN  NaN
    2013-11-30  NaN  NaN
    2013-12-31  NaN  NaN
    2014-01-31   25   15
    2014-02-28   60   80
    
    In [14]: df1.resample("M", how='sum')  # the same
    Out[14]:
                abc  xyz
    Date
    2013-06-30   40  125
    2013-07-31  NaN  NaN
    2013-08-31   40   -5
    2013-09-30  NaN  NaN
    2013-10-31  NaN  NaN
    2013-11-30  NaN  NaN
    2013-12-31  NaN  NaN
    2014-01-31   25   15
    2014-02-28   60   80
    

    Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.


    I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.

    If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:

    In [21]: df
    Out[21]:
            Date  abc  xyz
    0 2013-06-01  100  200
    1 2013-06-03  -20   50
    2 2013-08-15   40   -5
    3 2014-01-20   25   15
    4 2014-02-21   60   80
    
    In [22]: pd.DatetimeIndex(df.Date).to_period("M")  # old way
    Out[22]:
    <class 'pandas.tseries.period.PeriodIndex'>
    [2013-06, ..., 2014-02]
    Length: 5, Freq: M
    
    In [23]: per = df.Date.dt.to_period("M")  # new way to get the same
    
    In [24]: g = df.groupby(per)
    
    In [25]: g.sum()  # dang not quite what we want (doesn't fill in the gaps)
    Out[25]:
             abc  xyz
    2013-06   80  250
    2013-08   40   -5
    2014-01   25   15
    2014-02   60   80
    

    To get the desired result we have to reindex...

    0 讨论(0)
提交回复
热议问题