I have a pandas DataFrame which details online activities in terms of \"clicks\" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
gb = (df[df.Session >= df.Registration]
.groupby('User_ID')
.clicks.agg({'Total_Clicks': np.sum}))
>>> gb
Total_Clicks
User_ID
1987293 1
2234214 7
2349876 2
9874452 2
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.