Suppose I have a dataframe like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total day_90
----------
Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.
Let's say you have a Series:
>>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0 0
1 0
2 0
3 13
4 0
5 0
6 4
7 12
8 0
9 0
And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":
# Every time the number is non-zero, a new "island" is created
>>> b = (a != 0).cumsum()
0 0 <-- island 0
1 0
2 0
3 1 <-- island 1
4 1
5 1
6 2 <-- island 2
7 3 <-- island 3
8 3
9 3
For each island, we are only interested in elements that are equal to 0:
>>> c = b[a == 0]
0 0
1 0
2 0
4 1
5 1
8 3
9 3
Now let's determine the size of each island:
>>> d = c.groupby(c).count()
0 3 <-- island 0 is of size 3
1 2 <-- island 1 is of size 2
3 2 <-- island 3 is of size 2
dtype: int64
And filter for islands whose size >= 3:
>>> e = d[d >= 3]
0 3
The answer is the first element of e
(island 0, size 3) if e
is not empty. Otherwise, there's no island meeting our criteria.
And applying it to your problem:
def count_sequence_length(row, n):
"""Return of the length of the first sequence of 0
after the column in `day_90` whose length is >= n
"""
if row['day_90'] + n > 90:
return 0
# The columns after `day_90`
idx = np.arange(row['day_90']+1, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
e = d[d >= n]
return 0 if len(e) == 0 else e.iloc[0]
df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)
The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for
loop works much faster:
def count_sequence_length_2(row, n):
if row['day_90'] + n > 90:
return 0
size = 0
for i in range(row['day_90']+1, 91):
if row[i] == 0:
# increase the size of the current island
size += 1
elif size >= n:
# found the island we want. Search no more
break
else:
# create a new island
size = 0
return size if size >= n else 0
df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)
This achieves a speed up between 10 - 20x on when I benchmark it.
Here is my solution, see the comments in the code:
import numpy as np, pandas as pd
import io
# Test data:
text=""" ID 0 1 2 3 4 5 6 7 8 day_90
0 A 2 21 0 18 3 0 0 0 2 4
1 B 0 20 12 2 0 8 14 23 0 5
2 C 0 38 19 3 1 3 3 7 1 1
3 D 3 0 0 1 0 0 0 0 0 0"""
df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
#------------------------
# Convert some column names into integer:
cols= list(range(9))
df.columns= ["ID"]+ cols +["day_90"]
#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# The required length of the 1st zero sequence:
lseq= 2
# The function for aggregating: this is the main calculation, 'r' is a row of 'df':
def zz(r):
s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in 'day_90'
#--- Manipulate 's' to make possible using 'groupby' for getting different sequences:
crit=s.eq(0)
s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
if np.isnan(s.iloc[0]):
s.iloc[0]= 1
s= s.ffill()
s[~crit]= np.nan
#---
# get the sequences and their sizes:
ssiz= s.groupby(s).size()
return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
#---
df["zseq"]= df.agg(zz,axis=1)
ID 0 1 2 3 4 5 6 7 8 day_90 zseq
0 A 2 21 0 18 3 0 0 0 2 4 3.0
1 B 0 20 12 2 0 8 14 23 0 5 NaN
2 C 0 38 19 3 1 3 3 7 1 1 NaN
3 D 3 0 0 1 0 0 0 0 0 0 NaN