Length of first sequence of zeros of given size after certain column in pandas dataframe

前端 未结 2 705
清酒与你
清酒与你 2020-12-19 23:45

Suppose I have a dataframe like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
----------         


        
相关标签:
2条回答
  • 2020-12-20 00:18

    Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.

    Let's say you have a Series:

    >>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
    0     0
    1     0
    2     0
    3    13
    4     0
    5     0
    6     4
    7    12
    8     0
    9     0
    

    And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":

    # Every time the number is non-zero, a new "island" is created
    >>> b = (a != 0).cumsum()
    0    0  <-- island 0
    1    0
    2    0
    3    1  <-- island 1
    4    1
    5    1
    6    2  <-- island 2
    7    3  <-- island 3
    8    3
    9    3
    

    For each island, we are only interested in elements that are equal to 0:

    >>> c = b[a == 0]
    0    0
    1    0
    2    0
    4    1
    5    1
    8    3
    9    3
    

    Now let's determine the size of each island:

    >>> d = c.groupby(c).count()
    0    3  <-- island 0 is of size 3
    1    2  <-- island 1 is of size 2
    3    2  <-- island 3 is of size 2
    dtype: int64
    

    And filter for islands whose size >= 3:

    >>> e = d[d >= 3]
    0    3
    

    The answer is the first element of e (island 0, size 3) if e is not empty. Otherwise, there's no island meeting our criteria.


    First Attempt

    And applying it to your problem:

    def count_sequence_length(row, n):
        """Return of the length of the first sequence of 0
        after the column in `day_90` whose length is >= n
        """
        if row['day_90'] + n > 90:
            return 0
        
        # The columns after `day_90`
        idx = np.arange(row['day_90']+1, 91)
    
        a = row[idx]
        b = (a != 0).cumsum()
        c = b[a == 0]
        d = c.groupby(c).count()
        e = d[d >= n]
        
        return 0 if len(e) == 0 else e.iloc[0]
    
    df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)
    

    Second Attempt

    The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for loop works much faster:

    def count_sequence_length_2(row, n):
        if row['day_90'] + n > 90:
            return 0
        
        size = 0
        for i in range(row['day_90']+1, 91):
            if row[i] == 0:
                # increase the size of the current island
                size += 1
            elif size >= n:
                # found the island we want. Search no more
                break
            else:
                # create a new island
                size = 0
        return size if size >= n else 0
    
    df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)
    

    This achieves a speed up between 10 - 20x on when I benchmark it.

    0 讨论(0)
  • 2020-12-20 00:44

    Here is my solution, see the comments in the code:

    import numpy as np, pandas as pd
    import io
    
    # Test data:
    text="""  ID  0   1   2   3  4  5   6   7  8  day_90
            0  A  2  21   0  18  3  0   0   0  2       4
            1  B  0  20  12   2  0  8  14  23  0       5
            2  C  0  38  19   3  1  3   3   7  1       1
            3  D  3   0   0   1  0  0   0   0  0       0"""
    
    df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
    #------------------------
    
    # Convert some column names into integer:
    cols= list(range(9))
    df.columns= ["ID"]+ cols +["day_90"]
    
    #----------
    istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
    # The required length of the 1st zero sequence:
    lseq= 2
    
    # The function for aggregating: this is the main calculation, 'r' is a row of 'df':
    def zz(r):
    
         s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in 'day_90'
         #--- Manipulate 's' to make possible using 'groupby' for getting different sequences:
         crit=s.eq(0)
         s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
         if np.isnan(s.iloc[0]):
           s.iloc[0]= 1
         s= s.ffill()
         s[~crit]= np.nan
         #---
         # get the sequences and their sizes:
         ssiz= s.groupby(s).size()
         return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
    #---
    
    df["zseq"]= df.agg(zz,axis=1)
    
    ID  0   1   2   3  4  5   6   7  8  day_90  zseq
    0  A  2  21   0  18  3  0   0   0  2       4   3.0
    1  B  0  20  12   2  0  8  14  23  0       5   NaN
    2  C  0  38  19   3  1  3   3   7  1       1   NaN
    3  D  3   0   0   1  0  0   0   0  0       0   NaN
    
    0 讨论(0)
提交回复
热议问题