In Pandas, after groupby the grouped column is gone

前端 未结 4 598
萌比男神i
萌比男神i 2021-01-04 09:48

I have the following dataframe named ttm:

    usersidid   clienthostid    eventSumTotal   LoginDaysSum    score
0       12          1               60                


        
相关标签:
4条回答
  • 2021-01-04 10:36

    For return DataFrame after groupby are 2 possible solutions:

    1. parameter as_index=False what works nice with count, sum, mean functions

    2. reset_index for create new column from levels of index, more general solution

    df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
    print (df)
       clienthostid  LoginDaysSum
    0             1             4
    1             3             2
    
    df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
    print (df)
       clienthostid  LoginDaysSum
    0             1             4
    1             3             2
    

    For second need remove as_index=False and instead add reset_index:

    #output is `Series`
    a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
             .apply(lambda x: x.iloc[0] / x.iloc[1])
    print (a)
    clienthostid
    1    1.0
    3    1.5
    Name: LoginDaysSum, dtype: float64
    
    print (type(a))
    <class 'pandas.core.series.Series'>
    
    print (a.index)
    Int64Index([1, 3], dtype='int64', name='clienthostid')
    
    
    df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
             .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
    print (df1)
       clienthostid  ratio
    0             1    1.0
    1             3    1.5
    

    Why some columns are gone?

    I think there can be problem automatic exclusion of nuisance columns:

    #convert column to str
    ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
    print (ttm)
      usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
    0      12aa             1             60             3   1728
    1      11aa             1            240             3   1331
    3       5aa             1              5             3    125
    4       6aa             1             16             2    216
    2      10aa             3            270             3   1000
    5       8aa             3             18             2    512
    
    #removed str column userid
    a = ttm.groupby(['clienthostid'], sort=False).sum()
    print (a)
                  eventSumTotal  LoginDaysSum  score
    clienthostid                                    
    1                       321            11   3400
    3                       288             5   1512
    

    What is the difference between size and count in pandas?

    0 讨论(0)
  • 2021-01-04 10:39

    count is a built in method for the groupby object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.

    #                         For a built in method, when
    #                         you don't want the group column
    #                         as the index, pandas keeps it in
    #                         as a column.
    #                             |----||||----|
    ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
    
       clienthostid  LoginDaysSum
    0             1             4
    1             3             2
    

    #                         For a built in method, when
    #                         you do want the group column
    #                         as the index, then...
    #                             |----||||---|
    ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
    #                                                       |-----||||-----|
    #                                                 the single brackets tells
    #                                                 pandas to operate on a series
    #                                                 in this case, count the series
    
    clienthostid
    1    4
    3    2
    Name: LoginDaysSum, dtype: int64
    

    ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
    #                                                       |------||||------|
    #                                             the double brackets tells pandas
    #                                                to operate on the dataframe
    #                                              specified by these columns and will
    #                                                return a dataframe
    
                  LoginDaysSum
    clienthostid              
    1                        4
    3                        2
    

    When you used apply pandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True to keep the grouping column information in the index. Then follow it up with a reset_index to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.

    ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
    
    0    1.0
    1    1.5
    dtype: float64
    

    ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
    
       clienthostid  LoginDaysSum
    0             1           1.0
    1             3           1.5
    
    0 讨论(0)
  • 2021-01-04 10:45

    Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.

    Try fill the 'null' with some value.

    Like this:

    df.fillna('')
    
    0 讨论(0)
  • 2021-01-04 10:46

    You simply need this instead:

    ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
    

    The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series.

    0 讨论(0)
提交回复
热议问题