How to sum in pandas by unique index in several columns?

前端未结

关注

 3  1804

I have a pandas DataFrame which details online activities in terms of \"clicks\" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1

相关标签:

3条回答

迷失自我

2021-02-04 10:11
suppose your dataframe name is df, then do the following
```
df.groupby(['User_ID']).sum()[['User_ID','clicks']]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2021-02-04 10:26

IIUC you can use groupby, sum and reset_index:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

If first column User_ID is index:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

Or:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

EDIT:

As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

I change 3. row of data for better sample:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool

print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2

df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2

0 讨论(0)

挽巷

2021-02-04 10:30
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
```
gb = (df[df.Session >= df.Registration]
      .groupby('User_ID')
      .clicks.agg({'Total_Clicks': np.sum}))

>>> gb
         Total_Clicks
User_ID              
1987293             1
2234214             7
2349876             2
9874452             2
```
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.
0 讨论(0)
发布评论:

提交评论
- 加载中...