问题
There is a way to shift a dataframe column dependently on the condition on two other columns? something like:
df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])
I have figured out a way to do this but it's inefficient:
1)Load data and create the column to shift
df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)
output:
user open_time close_time value close_cumsum
0 1 2017-01-01 2017-03-01 5 18
1 1 2017-01-02 2017-02-01 6 6
2 1 2017-02-03 2017-02-05 7 13
3 1 2017-02-07 2017-04-01 3 21
4 1 2017-09-07 2017-09-11 1 22
5 2 2018-01-01 2018-02-01 15 15
6 2 2018-03-01 2018-04-01 3 18
2) shift the column with a self-join and some filters
Self-join (this is memory inefficient) df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')
filter for 'close_time' < 'open_time'. Then get the row with the max close_time
df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]
3)merge with the original dataset:
df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)
output:
user open_time close_time value close_cumsum
0 1 2017-01-01 2017-03-01 5 NaN
1 1 2017-01-02 2017-02-01 6 NaN
2 1 2017-02-03 2017-02-05 7 6.0
3 1 2017-02-07 2017-04-01 3 13.0
4 1 2017-09-07 2017-09-11 1 21.0
5 2 2018-01-01 2018-02-01 15 NaN
6 2 2018-03-01 2018-04-01 3 15.0
There is a more pandas way to get the same result?
Edit: I have added one data line to make the case more clear. My goal is to get the sum of all transactions closed before the opening time of the new transaction
回答1:
I made a modification to you test case that I think you should include. This solution does handle your edit.
import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df
input:
user open_time close_time value
0 1 12/30/2016 12/31/2016 1
1 1 1/1/2017 3/1/2017 5
2 1 1/2/2017 2/1/2017 6
3 1 2/3/2017 2/5/2017 7
4 1 2/7/2017 4/1/2017 3
5 1 9/7/2017 9/11/2017 1
6 2 1/1/2018 2/1/2018 15
7 2 3/1/2018 4/1/2018 3
create columns to shift:
df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df
user open_time close_time value close_cumsum
0 1 2016-12-30 2016-12-31 1 1
1 1 2017-01-01 2017-03-01 5 19
2 1 2017-01-02 2017-02-01 6 7
3 1 2017-02-03 2017-02-05 7 14
4 1 2017-02-07 2017-04-01 3 22
5 1 2017-09-07 2017-09-11 1 23
6 2 2018-01-01 2018-02-01 15 15
7 2 2018-03-01 2018-04-01 3 18
Shift columns (explanation below):
df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df
user open_time close_time value close_cumsum cumulated_closed_value
0 1 2016-12-30 2016-12-31 1 1 0.0
1 1 2017-01-01 2017-03-01 5 19 1.0
2 1 2017-01-02 2017-02-01 6 7 1.0
3 1 2017-02-03 2017-02-05 7 14 7.0
4 1 2017-02-07 2017-04-01 3 22 14.0
5 1 2017-09-07 2017-09-11 1 23 22.0
6 2 2018-01-01 2018-02-01 15 15 0.0
7 2 2018-03-01 2018-04-01 3 18 15.0
All of this has been written is such a way that it's done across all users. I believe the logic is easier if you only focus on one user at a time.
- Assume no events happens at the same time. This is the same thing as shifting the cumulative sum down one row.
- Remove events that happens at the same time as other events.
- Fill in the missing values. With a forwards fill.
I would still thoroughly test this before you use it. Time intervals are weird and there are a lot of edge cases.
回答2:
I am using a new para here record the condition df2['close_time']<df2['open_time']
df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)
s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)
s.loc[~(df.New.shift()==True)]=np.nan
df['Cumsum']=s
df
Out[1043]:
user open_time close_time value New Cumsum
0 1 2017-01-01 2017-03-01 5 False NaN
1 1 2017-01-02 2017-02-01 6 True NaN
2 1 2017-02-03 2017-02-05 7 True 6
3 1 2017-02-07 2017-04-01 3 False 13
4 2 2017-01-01 2017-02-01 15 True NaN
5 2 2017-03-01 2017-04-01 3 NaN 15
Update : since op update the question (Data from Gabriel A)
df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))
df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))
df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New']) ]
df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))
df.drop(['New','New1','New2'],1)
Out[1376]:
user open_time close_time value Yourtarget
0 1 2016-12-30 2016-12-31 1 0
1 1 2017-01-01 2017-03-01 5 1
2 1 2017-01-02 2017-02-01 6 1
3 1 2017-02-03 2017-02-05 7 7
4 1 2017-02-07 2017-04-01 3 14
5 1 2017-09-07 2017-09-11 1 22
6 2 2018-01-01 2018-02-01 15 0
7 2 2018-03-01 2018-04-01 3 15
回答3:
(Note: @wen's answer seems fine to me, so I'm not sure if the OP is looking for something more or something different. In any event, here's an alternate approach using merge_asof
that should also work well.)
First reshape the dataframes as follows:
lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])
.sort_index().groupby('user').cumsum().reset_index(0) )
df = df.set_index('open_time').sort_index()
The idea with "lookup" is simply to sort by "close_time" and then take a (grouped) cumulative sum:
user value
close_time
2017-02-01 1 6
2017-02-05 1 13
2017-03-01 1 18
2017-04-01 1 21
2017-09-11 1 22
2018-02-01 2 15
2018-04-01 2 18
For "df" we just take a subset of the original dataframe:
user close_time value
open_time
2017-01-01 1 2017-03-01 5
2017-01-02 1 2017-02-01 6
2017-02-03 1 2017-02-05 7
2017-02-07 1 2017-04-01 3
2017-09-07 1 2017-09-11 1
2018-01-01 2 2018-02-01 15
2018-03-01 2 2018-04-01 3
From here, you just want to conceptually merge the two datasets on "user" and the "open_time"/"close_time" but the complicating factor is that we don't want to do an exact match on the time, but rather a sort of "nearest" match.
For these sorts merges you can use merge_asof
which is a great tool for various non-exact matches (including 'nearest', 'backward', and 'forward'). Unfortunately due to the inclusion of groupby, it's necessary to also loop over the users, but it's still pretty simple code to read:
df_merged = pd.DataFrame()
for u in df['user'].unique():
df_merged = df_merged.append( pd.merge_asof( df[df.user==u], lookup[lookup.user==u],
left_index=True, right_index=True,
direction='backward' ) )
df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)
Results:
user_x close_time value_x close_cumsum
open_time
2017-01-01 1 2017-03-01 5 NaN
2017-01-02 1 2017-02-01 6 NaN
2017-02-03 1 2017-02-05 7 6.0
2017-02-07 1 2017-04-01 3 13.0
2017-09-07 1 2017-09-11 1 21.0
2018-01-01 2 2018-02-01 15 NaN
2018-03-01 2 2018-04-01 3 15.0
来源:https://stackoverflow.com/questions/48646684/pandas-conditional-shift