问题
Im trying to compute a cumulative sum with a reset within a dataframe, based on the sign of each values. The idea is to the same exercise for each column separately.
For example, let's assume I have the following dataframe:
df = pd.DataFrame({'A': [1,1,1,-1,-1,1,1,1,1,-1,-1,-1],'B':[1,1,-1,-1,-1,1,1,1,-1,-1,-1,1]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
For each column, I want to compute the cumulative sum until I find a change in sign; in which case, the sum should be reset to 1. For the example above, I am expecting the following result:
df1=pd.DataFrame({'A_cumcount':[1,2,3,1,2,1,2,3,4,1,2,3],'B_cumcount':[1,2,1,2,3,1,2,3,1,2,3,4],index=[0,1,2,3,4,5,6,7,8,9,10,11]})
Similar issue has been discussed here: Pandas: conditional rolling count
I have tried the following code:
nb_col=len(df.columns) #number of columns in dataframe
for i in range(0,int(nb_col)): #Loop through the number of columns in the dataframe
name=df.columns[i] #read the column name
name=name+'_cumcount'
#add column for the calculation
df=df.reindex(columns=np.append(df.columns.values, [name]))
df=df[df.columns[nb_col+i]]=df.groupby((df[df.columns[i]] != df[df.columns[i]].shift(1)).cumsum()).cumcount()+1
My question is, is there a way to avoid this for loop? So I can avoid appending a new column each time and make the computation faster. Thank you
Answers received (all working fine):
From @nixon
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).add_suffix('_cumcount')
From @jezrael
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1).add_suffix('_cumcount'))
From @Scott Boston:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
回答1:
I think in pandas need loop, e.g. by apply
:
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
.add_suffix('_cumcount'))
print (df1)
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
回答2:
You can try this:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
Output:
A B
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
回答3:
You can start by grouping by where the changes in the sequence occur by doing x.diff().ne(0).cumsum()
, and using cumcount over the groups:
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum())
.cumcount()+1).add_suffix('_cumcount')
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
来源:https://stackoverflow.com/questions/53614476/conditional-count-of-cumulative-sum-dataframe-loop-through-columns