Pandas groupby month and year

后端未结

关注

 4  1386

I have the following dataframe:

Date        abc    xyz
01-Jun-13   100    200
03-Jun-13   -20    50
15-Aug-13   40     -5
20-Jan-14   25     15
21-Feb-14   6


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-11-27 11:36
              
            
            
                                                                       
There are different ways to do that. 


I created the data frame to showcase the different techniques to filter your data. 



df = pd.DataFrame({'Date':['01-Jun-13','03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],

  
  'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })



I separated months/year/day and seperated month-year as you explained.



def getMonth(s):
  return s.split("-")[1]

def getDay(s):
  return s.split("-")[0]

def getYear(s):
  return s.split("-")[2]

def getYearMonth(s):
  return s.split("-")[1]+"-"+s.split("-")[2]




I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth



df['year']= df['Date'].apply(lambda x: getYear(x))
df['month']= df['Date'].apply(lambda x: getMonth(x))
df['day']= df['Date'].apply(lambda x: getDay(x))
df['YearMonth']= df['Date'].apply(lambda x: getYearMonth(x))



Output: 

        Date  abc  xyz year month day YearMonth
0  01-Jun-13  100  200   13   Jun  01    Jun-13
1  03-Jun-13  -20   50   13   Jun  03    Jun-13
2  15-Aug-13   40   -5   13   Aug  15    Aug-13
3  20-Jan-14   25   15   14   Jan  20    Jan-14
4  21-Feb-14   60   80   14   Feb  21    Feb-14



You can go through the different groups in groupby(..) items. 


In this case, we are grouping by two columns: 


for key,g in df.groupby(['year','month']):
    print key,g



Output: 

('13', 'Jun')         Date  abc  xyz year month day YearMonth
0  01-Jun-13  100  200   13   Jun  01    Jun-13
1  03-Jun-13  -20   50   13   Jun  03    Jun-13
('13', 'Aug')         Date  abc  xyz year month day YearMonth
2  15-Aug-13   40   -5   13   Aug  15    Aug-13
('14', 'Jan')         Date  abc  xyz year month day YearMonth
3  20-Jan-14   25   15   14   Jan  20    Jan-14
('14', 'Feb')         Date  abc  xyz year month day YearMonth


In this case, we are grouping by one column: 


for key,g in df.groupby(['YearMonth']):
    print key,g



Output: 

Jun-13         Date  abc  xyz year month day YearMonth
0  01-Jun-13  100  200   13   Jun  01    Jun-13
1  03-Jun-13  -20   50   13   Jun  03    Jun-13
Aug-13         Date  abc  xyz year month day YearMonth
2  15-Aug-13   40   -5   13   Aug  15    Aug-13
Jan-14         Date  abc  xyz year month day YearMonth
3  20-Jan-14   25   15   14   Jan  20    Jan-14
Feb-14         Date  abc  xyz year month day YearMonth
4  21-Feb-14   60   80   14   Feb  21    Feb-14



In case you wanna access to specific item, you can use get_group



  print df.groupby(['YearMonth']).get_group('Jun-13')


Output: 

        Date  abc  xyz year month day YearMonth
0  01-Jun-13  100  200   13   Jun  01    Jun-13
1  03-Jun-13  -20   50   13   Jun  03    Jun-13



Similar to get_group. This hack would help to filter values and get the grouped values.


This also would give the same result.

print df[df['YearMonth']=='Jun-13'] 


Output:

        Date  abc  xyz year month day YearMonth
0  01-Jun-13  100  200   13   Jun  01    Jun-13
1  03-Jun-13  -20   50   13   Jun  03    Jun-13


You can select list of abc or xyz values during Jun-13 

print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values


Output: 

[100 -20]  #abc values
[200  50]  #xyz values


You can use this to go through the dates that you have classified as "year-month" and apply cretiria on it to get related data. 

for x in set(df.YearMonth): 
    print df[df['YearMonth']==x].abc.values
    print df[df['YearMonth']==x].xyz.values


I recommend also to check this answer as well.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-11-27 11:49
              
            
            
                                                                       
Why not keep it simple?!

GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()


giving you,

print(GB)
        abc  xyz
2013 6   80  250
     8   40   -5
2014 1   25   15
     2   60   80


and then you can plot like asked using,

GB.plot('abc','xyz',kind='scatter')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-11-27 11:50
              
            
            
                                                                       
You can also do it by creating a string column with the year and month as follows:

df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')


However this doesn't preserve the order when you loop over the groups, e.g. 

for name, group in grouped:
    print(name)


Will give:

2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10


So then, if you want to preserve the order, you must do as suggested by @Q-man above:

grouped = df.groupby([df.index.year, df.index.month])


This will preserve the order in the above loop:

(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-11-27 12:01
              
            
            
                                                                       
You can use either resample or Grouper (which resamples under the hood).

First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:

In [11]: df1
Out[11]:
            abc  xyz
Date
2013-06-01  100  200
2013-06-03  -20   50
2013-08-15   40   -5
2014-01-20   25   15
2014-02-21   60   80

In [12]: g = df1.groupby(pd.Grouper(freq="M"))  # DataFrameGroupBy (grouped by Month)

In [13]: g.sum()
Out[13]:
            abc  xyz
Date
2013-06-30   80  250
2013-07-31  NaN  NaN
2013-08-31   40   -5
2013-09-30  NaN  NaN
2013-10-31  NaN  NaN
2013-11-30  NaN  NaN
2013-12-31  NaN  NaN
2014-01-31   25   15
2014-02-28   60   80

In [14]: df1.resample("M", how='sum')  # the same
Out[14]:
            abc  xyz
Date
2013-06-30   40  125
2013-07-31  NaN  NaN
2013-08-31   40   -5
2013-09-30  NaN  NaN
2013-10-31  NaN  NaN
2013-11-30  NaN  NaN
2013-12-31  NaN  NaN
2014-01-31   25   15
2014-02-28   60   80


Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.



I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.

If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:

In [21]: df
Out[21]:
        Date  abc  xyz
0 2013-06-01  100  200
1 2013-06-03  -20   50
2 2013-08-15   40   -5
3 2014-01-20   25   15
4 2014-02-21   60   80

In [22]: pd.DatetimeIndex(df.Date).to_period("M")  # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M

In [23]: per = df.Date.dt.to_period("M")  # new way to get the same

In [24]: g = df.groupby(per)

In [25]: g.sum()  # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
         abc  xyz
2013-06   80  250
2013-08   40   -5
2014-01   25   15
2014-02   60   80


To get the desired result we have to reindex...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复