Pandas: conditional shift

问题

There is a way to shift a dataframe column dependently on the condition on two other columns? something like:

df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])

I have figured out a way to do this but it's inefficient:

1)Load data and create the column to shift

df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)

output:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5            18
1     1 2017-01-02 2017-02-01      6             6
2     1 2017-02-03 2017-02-05      7            13
3     1 2017-02-07 2017-04-01      3            21
4     1 2017-09-07 2017-09-11      1            22
5     2 2018-01-01 2018-02-01     15            15
6     2 2018-03-01 2018-04-01      3            18

2) shift the column with a self-join and some filters

Self-join (this is memory inefficient) df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')

filter for 'close_time' < 'open_time'. Then get the row with the max close_time

df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]

3)merge with the original dataset:

df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)

output:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5           NaN
1     1 2017-01-02 2017-02-01      6           NaN
2     1 2017-02-03 2017-02-05      7           6.0
3     1 2017-02-07 2017-04-01      3          13.0
4     1 2017-09-07 2017-09-11      1          21.0
5     2 2018-01-01 2018-02-01     15           NaN
6     2 2018-03-01 2018-04-01      3          15.0

There is a more pandas way to get the same result?

Edit: I have added one data line to make the case more clear. My goal is to get the sum of all transactions closed before the opening time of the new transaction

回答1:

I made a modification to you test case that I think you should include. This solution does handle your edit.

import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df

input:

   user open_time   close_time  value
0   1   12/30/2016  12/31/2016  1
1   1   1/1/2017    3/1/2017    5
2   1   1/2/2017    2/1/2017    6
3   1   2/3/2017    2/5/2017    7
4   1   2/7/2017    4/1/2017    3
5   1   9/7/2017    9/11/2017   1
6   2   1/1/2018    2/1/2018    15
7   2   3/1/2018    4/1/2018    3

create columns to shift:

df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df


   user open_time   close_time  value   close_cumsum
0   1   2016-12-30  2016-12-31  1       1
1   1   2017-01-01  2017-03-01  5       19
2   1   2017-01-02  2017-02-01  6       7
3   1   2017-02-03  2017-02-05  7       14
4   1   2017-02-07  2017-04-01  3       22
5   1   2017-09-07  2017-09-11  1       23
6   2   2018-01-01  2018-02-01  15      15
7   2   2018-03-01  2018-04-01  3       18

Shift columns (explanation below):

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df


user    open_time   close_time  value   close_cumsum    cumulated_closed_value
0   1   2016-12-30  2016-12-31  1       1               0.0
1   1   2017-01-01  2017-03-01  5       19              1.0
2   1   2017-01-02  2017-02-01  6       7               1.0
3   1   2017-02-03  2017-02-05  7       14              7.0
4   1   2017-02-07  2017-04-01  3       22              14.0
5   1   2017-09-07  2017-09-11  1       23              22.0
6   2   2018-01-01  2018-02-01  15      15              0.0
7   2   2018-03-01  2018-04-01  3       18              15.0

All of this has been written is such a way that it's done across all users. I believe the logic is easier if you only focus on one user at a time.

Assume no events happens at the same time. This is the same thing as shifting the cumulative sum down one row.
Remove events that happens at the same time as other events.
Fill in the missing values. With a forwards fill.

I would still thoroughly test this before you use it. Time intervals are weird and there are a lot of edge cases.

回答2:

I am using a new para here record the condition df2['close_time']<df2['open_time']

df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)
s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)
s.loc[~(df.New.shift()==True)]=np.nan

df['Cumsum']=s




df

Out[1043]: 
   user  open_time close_time  value    New Cumsum
0     1 2017-01-01 2017-03-01      5  False    NaN
1     1 2017-01-02 2017-02-01      6   True    NaN
2     1 2017-02-03 2017-02-05      7   True      6
3     1 2017-02-07 2017-04-01      3  False     13
4     2 2017-01-01 2017-02-01     15   True    NaN
5     2 2017-03-01 2017-04-01      3    NaN     15

Update : since op update the question (Data from Gabriel A)

df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))
df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))
df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New'])  ]
df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))
df.drop(['New','New1','New2'],1)


Out[1376]: 
   user  open_time close_time  value  Yourtarget
0     1 2016-12-30 2016-12-31      1           0
1     1 2017-01-01 2017-03-01      5           1
2     1 2017-01-02 2017-02-01      6           1
3     1 2017-02-03 2017-02-05      7           7
4     1 2017-02-07 2017-04-01      3          14
5     1 2017-09-07 2017-09-11      1          22
6     2 2018-01-01 2018-02-01     15           0
7     2 2018-03-01 2018-04-01      3          15

回答3:

(Note: @wen's answer seems fine to me, so I'm not sure if the OP is looking for something more or something different. In any event, here's an alternate approach using merge_asof that should also work well.)

First reshape the dataframes as follows:

lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])
           .sort_index().groupby('user').cumsum().reset_index(0) )

df = df.set_index('open_time').sort_index()

The idea with "lookup" is simply to sort by "close_time" and then take a (grouped) cumulative sum:

            user  value
close_time             
2017-02-01     1      6
2017-02-05     1     13
2017-03-01     1     18
2017-04-01     1     21
2017-09-11     1     22
2018-02-01     2     15
2018-04-01     2     18

For "df" we just take a subset of the original dataframe:

            user close_time  value
open_time                         
2017-01-01     1 2017-03-01      5
2017-01-02     1 2017-02-01      6
2017-02-03     1 2017-02-05      7
2017-02-07     1 2017-04-01      3
2017-09-07     1 2017-09-11      1
2018-01-01     2 2018-02-01     15
2018-03-01     2 2018-04-01      3

From here, you just want to conceptually merge the two datasets on "user" and the "open_time"/"close_time" but the complicating factor is that we don't want to do an exact match on the time, but rather a sort of "nearest" match.

For these sorts merges you can use merge_asof which is a great tool for various non-exact matches (including 'nearest', 'backward', and 'forward'). Unfortunately due to the inclusion of groupby, it's necessary to also loop over the users, but it's still pretty simple code to read:

df_merged = pd.DataFrame()

for u in df['user'].unique():
    df_merged = df_merged.append( pd.merge_asof( df[df.user==u],  lookup[lookup.user==u], 
                                                 left_index=True, right_index=True, 
                                                 direction='backward' ) )

df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)

Results:

            user_x close_time  value_x  close_cumsum
open_time                                           
2017-01-01       1 2017-03-01        5           NaN
2017-01-02       1 2017-02-01        6           NaN
2017-02-03       1 2017-02-05        7           6.0
2017-02-07       1 2017-04-01        3          13.0
2017-09-07       1 2017-09-11        1          21.0
2018-01-01       2 2018-02-01       15           NaN
2018-03-01       2 2018-04-01        3          15.0

来源：https://stackoverflow.com/questions/48646684/pandas-conditional-shift

标签

python

pandas

datetime

data-analysis