How to sum in pandas by unique index in several columns?

前端 未结 3 1804
耶瑟儿~
耶瑟儿~ 2021-02-04 09:32

I have a pandas DataFrame which details online activities in terms of \"clicks\" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1

相关标签:
3条回答
  • 2021-02-04 10:11

    suppose your dataframe name is df, then do the following

    df.groupby(['User_ID']).sum()[['User_ID','clicks']]
    
    0 讨论(0)
  • 2021-02-04 10:26

    IIUC you can use groupby, sum and reset_index:

    print df
       User_ID Registration    Session  clicks
    0  2349876   2012-02-22 2014-04-24       2
    1  1987293   2011-02-01 2013-05-03       1
    2  2234214   2012-07-22 2014-01-22       7
    3  9874452   2010-12-22 2014-08-22       2
    
    print df.groupby('User_ID')['clicks'].sum().reset_index()
       User_ID  clicks
    0  1987293       1
    1  2234214       7
    2  2349876       2
    3  9874452       2
    

    If first column User_ID is index:

    print df
            Registration    Session  clicks
    User_ID                                
    2349876   2012-02-22 2014-04-24       2
    1987293   2011-02-01 2013-05-03       1
    2234214   2012-07-22 2014-01-22       7
    9874452   2010-12-22 2014-08-22       2
    
    print df.groupby(level=0)['clicks'].sum().reset_index()
       User_ID  clicks
    0  1987293       1
    1  2234214       7
    2  2349876       2
    3  9874452       2
    

    Or:

    print df.groupby(df.index)['clicks'].sum().reset_index()
       User_ID  clicks
    0  1987293       1
    1  2234214       7
    2  2349876       2
    3  9874452       2
    

    EDIT:

    As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:

    print df
       User_ID Registration    Session  clicks
    0  2349876   2012-02-22 2014-04-24       2
    1  1987293   2011-02-01 2013-05-03       1
    2  2234214   2012-07-22 2014-01-22       7
    3  9874452   2010-12-22 2014-08-22       2
    
    print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
       User_ID  clicks
    0  1987293       1
    1  2234214       7
    2  2349876       2
    3  9874452       2
    

    I change 3. row of data for better sample:

    print df
            Registration    Session  clicks
    User_ID                                
    2349876   2012-02-22 2014-04-24       2
    1987293   2011-02-01 2013-05-03       1
    2234214   2012-07-22 2012-01-22       7
    9874452   2010-12-22 2014-08-22       2
    
    print df.Session >= df.Registration
    User_ID
    2349876     True
    1987293     True
    2234214    False
    9874452     True
    dtype: bool
    
    print df[df.Session >= df.Registration]
            Registration    Session  clicks
    User_ID                                
    2349876   2012-02-22 2014-04-24       2
    1987293   2011-02-01 2013-05-03       1
    9874452   2010-12-22 2014-08-22       2
    
    df1 = df[df.Session >= df.Registration]
    print df1.groupby(df1.index)['clicks'].sum().reset_index()
       User_ID  clicks
    0  1987293       1
    1  2349876       2
    2  9874452       2
    
    0 讨论(0)
  • 2021-02-04 10:30

    The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.

    gb = (df[df.Session >= df.Registration]
          .groupby('User_ID')
          .clicks.agg({'Total_Clicks': np.sum}))
    
    >>> gb
             Total_Clicks
    User_ID              
    1987293             1
    2234214             7
    2349876             2
    9874452             2
    

    For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.

    0 讨论(0)
提交回复
热议问题