Length of first sequence of zeros of given size after certain column in pandas dataframe

前端未结

关注

 2  705

Suppose I have a dataframe like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
----------


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  面向向阳花        
                
              
                            
                2020-12-20 00:18
              
            
            
                                                                       
Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.
Let's say you have a Series:
>>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0     0
1     0
2     0
3    13
4     0
5     0
6     4
7    12
8     0
9     0

And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":
# Every time the number is non-zero, a new "island" is created
>>> b = (a != 0).cumsum()
0    0  <-- island 0
1    0
2    0
3    1  <-- island 1
4    1
5    1
6    2  <-- island 2
7    3  <-- island 3
8    3
9    3

For each island, we are only interested in elements that are equal to 0:
>>> c = b[a == 0]
0    0
1    0
2    0
4    1
5    1
8    3
9    3

Now let's determine the size of each island:
>>> d = c.groupby(c).count()
0    3  <-- island 0 is of size 3
1    2  <-- island 1 is of size 2
3    2  <-- island 3 is of size 2
dtype: int64

And filter for islands whose size >= 3:
>>> e = d[d >= 3]
0    3

The answer is the first element of e (island 0, size 3) if e is not empty. Otherwise, there's no island meeting our criteria.

First Attempt
And applying it to your problem:
def count_sequence_length(row, n):
    """Return of the length of the first sequence of 0
    after the column in `day_90` whose length is >= n
    """
    if row['day_90'] + n > 90:
        return 0
    
    # The columns after `day_90`
    idx = np.arange(row['day_90']+1, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()
    e = d[d >= n]
    
    return 0 if len(e) == 0 else e.iloc[0]

df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)


Second Attempt
The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for loop works much faster:
def count_sequence_length_2(row, n):
    if row['day_90'] + n > 90:
        return 0
    
    size = 0
    for i in range(row['day_90']+1, 91):
        if row[i] == 0:
            # increase the size of the current island
            size += 1
        elif size >= n:
            # found the island we want. Search no more
            break
        else:
            # create a new island
            size = 0
    return size if size >= n else 0

df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)

This achieves a speed up between 10 - 20x on when I benchmark it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-20 00:44
              
            
            
                                                                       
Here is my solution, see the comments in the code:

import numpy as np, pandas as pd
import io

# Test data:
text="""  ID  0   1   2   3  4  5   6   7  8  day_90
        0  A  2  21   0  18  3  0   0   0  2       4
        1  B  0  20  12   2  0  8  14  23  0       5
        2  C  0  38  19   3  1  3   3   7  1       1
        3  D  3   0   0   1  0  0   0   0  0       0"""

df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
#------------------------

# Convert some column names into integer:
cols= list(range(9))
df.columns= ["ID"]+ cols +["day_90"]

#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# The required length of the 1st zero sequence:
lseq= 2

# The function for aggregating: this is the main calculation, 'r' is a row of 'df':
def zz(r):

     s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in 'day_90'
     #--- Manipulate 's' to make possible using 'groupby' for getting different sequences:
     crit=s.eq(0)
     s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
     if np.isnan(s.iloc[0]):
       s.iloc[0]= 1
     s= s.ffill()
     s[~crit]= np.nan
     #---
     # get the sequences and their sizes:
     ssiz= s.groupby(s).size()
     return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
#---

df["zseq"]= df.agg(zz,axis=1)

ID  0   1   2   3  4  5   6   7  8  day_90  zseq
0  A  2  21   0  18  3  0   0   0  2       4   3.0
1  B  0  20  12   2  0  8  14  23  0       5   NaN
2  C  0  38  19   3  1  3   3   7  1       1   NaN
3  D  3   0   0   1  0  0   0   0  0       0   NaN

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复