Extend numpy mask by n cells to the right for each bad value, efficiently

前端 未结 7 1341
[愿得一人]
[愿得一人] 2021-02-15 15:39

Let\'s say I have a length 30 array with 4 bad values in it. I want to create a mask for those bad values, but since I will be using rolling window functions, I\'d also like a f

相关标签:
7条回答
  • 2021-02-15 15:42

    A few years late, but I've come up with a fully vectorized solution that requires no loops or copies (besides the mask itself). This solution is a bit (potentially) dangerous because it uses numpy.lib.stride_tricks.as_strided. It's also not as fast as @swentzel's solution.

    The idea is to take the mask and create a 2D view of it, where the second dimension is just the elements that follow the current element. Then you can just set an entire column to True if the head is True. Since you are dealing with a view, setting a column will actually set the following elements in the mask.

    Start with the data:

    import numpy as np
    a = np.array([4, 0, 8, 5, 10, 9, np.nan, 1, 4, 9, 9, np.nan, np.nan, 9,\
                  9, 8, 0, 3, 7, 9, 2, 6, 7, 2, 9, 4, 1, 1, np.nan, 10])
    n = 3
    

    Now, we will make the mask a.size + n elements long, so that you don't have to process the last n elements manually:

    mask = np.empty(a.size + n, dtype=np.bool)
    np.isnan(a, out=mask[:a.size])
    mask[a.size:] = False
    

    Now the cool part:

    view = np.lib.stride_tricks.as_strided(mask, shape=(n + 1, a.size),
                                           strides=mask.strides * 2)
    

    That last part is crucial. mask.strides is a tuple like (1,) (since bools are usually about that many bytes across. Doubling it means that you take a 1-byte step to move one element in any dimension.

    Now all you need to do is expand the mask:

    view[1:, view[0]] = True
    

    That's it. Now mask has what you want. Keep in mind that this only works because the assignment index precedes the last changed value. You could not get away with view[1:] |= view[0].

    For benching purposes, it appears that the definition of n has changed from the question, so the following function takes that into account:

    def madphysicist0(a, n):
        m = np.empty(a.size + n - 1, dtype=np.bool)
        np.isnan(a, out=m[:a.size])
        m[a.size:] = False
    
        v = np.lib.stride_tricks.as_strided(m, shape=(n, a.size), strides=m.strides * 2)
        v[1:, v[0]] = True
        return v[0]
    

    V2

    Taking a leaf out of the existing fastest answer, we only need to copy log2(n) rows, not n rows:

    def madphysicist1(a, n):
        m = np.empty(a.size + n - 1, dtype=np.bool)
        np.isnan(a, out=m[:a.size])
        m[a.size:] = False
    
        v = np.lib.stride_tricks.as_strided(m, shape=(n, a.size), strides=m.strides * 2)
    
        stop = int(np.log2(n))
        for k in range(1, stop + 1):
            v[k, v[0]] = True
        if (1<<k) < n:
            v[-1, v[(1<<k) - 1]] = True
        return v[0]
    

    Since this doubles the size of the mask at every iteration, it works a bit faster than Fibonacci: https://math.stackexchange.com/q/894743/295281

    0 讨论(0)
  • 2021-02-15 15:47

    This could also be considered a morphological dilation problem, using here the scipy.ndimage.binary_dilation:

    def dilation(a, n):
        m = np.isnan(a)
        s = np.full(n, True, bool)
        return ndimage.binary_dilation(m, structure=s, origin=-(n//2))
    

    Note on origin: this argument ensures the structure (I would call it a kernel) starts off a bit to the left of the input (your mask m). Normally the value at out[i] would be the dilation with the center of structure (which would be structure[n//2]) at in[i], but you want the structure[0] to be at in[i].

    You can also do this with a kernel that is padded on the left with Falses, which is what would be required if you used the binary_dilation from scikit-image:

    def dilation_skimage(a, n):
        m = np.isnan(a)
        s = np.zeros(2*n - n%2, bool)
        s[-n:] = True
        return skimage.morphology.binary_dilation(m, selem=s)
    

    Timing doesn't seem to change too much between the two:

    dilation_scipy
    small:    10 loops, best of 3: 47.9 ms per loop
    large: 10000 loops, best of 3: 88.9 µs per loop
    
    dilation_skimage
    small:    10 loops, best of 3: 47.0 ms per loop
    large: 10000 loops, best of 3: 91.1 µs per loop
    
    0 讨论(0)
  • 2021-02-15 15:54

    OP here with the benchmark results. I have included my own ("op") which I had started out with, which loops over the bad indices and adds 1...n to them then takes the uniques to find the mask indices. You can see it in the code below with all the other responses.

    Anyway here are the results. The facets are size of array along x (10 thru 10e7) and size of window along y(5, 50, 500, 5000). Then it's by coder in each facet, with a log-10 score because we're talking microseconds through minutes.

    @swenzel appears to be the winner with his second answer, displacing @moarningsun's first answer (moarningsun's second answer was crashing the machine through massive memory use, but that's probably because it was not designed for large n or non-sparse a).

    The chart does not do justice to the fastest of these contributions because of the (necessary) log scale. They're dozens, hundreds of times faster than even decent looping solutions. swenzel1 is 1000x faster than op in the largest case, and op is already making use of numpy.

    Please note that I have used a numpy version compiled against the optimised Intel MKL libraries which make full use of the AVX instructions present since 2012. In some vector use cases this will increase an i7/Xeon speed by a factor of 5. Some of the contributions may be benefitting more than others.

    Here is the full code to run all the submitted answers so far, including my own. Function allagree() makes sure that results are correct, while timeall() will give you a long-form pandas Dataframe with all the results in seconds.

    You can rerun it fairly easily with new code, or change my assumptions. Please keep in mind I did not take into account other factors such as memory usage. Also, I resorted to R ggplot2 for the graphic as I don't know seaborn/matplotlib well enough to make it do what I want.

    For completeness, all the results agree:

    In [4]: allagree(n = 7, asize = 777)
    Out[4]:
                 AGML0 AGML1 askewchan0 askewchan1 askewchan2 moarningsun0  \
    AGML0         True  True       True       True       True         True
    AGML1         True  True       True       True       True         True
    askewchan0    True  True       True       True       True         True
    askewchan1    True  True       True       True       True         True
    askewchan2    True  True       True       True       True         True
    moarningsun0  True  True       True       True       True         True
    swenzel0      True  True       True       True       True         True
    swenzel1      True  True       True       True       True         True
    op            True  True       True       True       True         True
    
                 swenzel0 swenzel1    op
    AGML0            True     True  True
    AGML1            True     True  True
    askewchan0       True     True  True
    askewchan1       True     True  True
    askewchan2       True     True  True
    moarningsun0     True     True  True
    swenzel0         True     True  True
    swenzel1         True     True  True
    op               True     True  True
    

    Thank you to all who submitted!

    Code for the graphic after exporting output of timeall() using pd.to_csv and read.csv in R:

    ww <- read.csv("ww.csv")    
    ggplot(ww, aes(x=coder, y=value, col = coder)) + geom_point(size = 3) + scale_y_continuous(trans="log10")+ facet_grid(nsize ~ asize) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Fastest by coder") + ylab("time (seconds)")
    

    Code for the test:

    # test Stack Overflow 32706135 nan shift routines
    
    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from timeit import Timer
    from scipy import ndimage
    from skimage import morphology
    import itertools
    import pdb
    np.random.seed(8472)
    
    
    def AGML0(a, n):                               # loop itertools
        maskleft = np.where(np.isnan(a))[0]
        maskright = maskleft + n
        mask = np.zeros(len(a),dtype=bool)
        for l,r in itertools.izip(maskleft,maskright): 
            mask[l:r] = True
        return mask
    
    
    def AGML1(a, n):                               # loop n
        nn = n - 1
        maskleft = np.where(np.isnan(a))[0]
        ghost_mask = np.zeros(len(a)+nn,dtype=bool)
        for i in range(0, nn+1):
            thismask = maskleft + i
            ghost_mask[thismask] = True
        mask = ghost_mask[:len(ghost_mask)-nn]
        return mask
    
    
    def askewchan0(a, n):
        m = np.isnan(a)
        i = np.arange(1, len(m)+1)
        ind = np.column_stack([i-n, i]) # may be a faster way to generate this
        ind.clip(0, len(m)-1, out=ind)
        return np.bitwise_or.reduceat(m, ind.ravel())[::2]
    
    
    def askewchan1(a, n):
        m = np.isnan(a)
        s = np.full(n, True, bool)
        return ndimage.binary_dilation(m, structure=s, origin=-(n//2))
    
    
    def askewchan2(a, n):
        m = np.isnan(a)
        s = np.zeros(2*n - n%2, bool)
        s[-n:] = True
        return morphology.binary_dilation(m, selem=s)
    
    
    def moarningsun0(a, n):
        mask = np.isnan(a)
        cs = np.cumsum(mask)
        cs[n:] -= cs[:-n].copy()
        return cs > 0
    
    
    def moarningsun1(a, n):
        mask = np.isnan(a)
        idx = np.flatnonzero(mask)
        expanded_idx = idx[:,None] + np.arange(1, n)
        np.put(mask, expanded_idx, True, 'clip')
        return mask
    
    
    def swenzel0(a, n):
        m = np.isnan(a)
        k = m.copy()
        for i in range(1, n):
            k[i:] |= m[:-i]
        return k
    
    
    def swenzel1(a, n=4):
        m = np.isnan(a)
        k = m.copy()
    
        # lenM and lenK say for each mask how many
        # subsequent Trues there are at least
        lenM, lenK = 1, 1
    
        # we run until a combination of both masks will give us n or more
        # subsequent Trues
        while lenM+lenK < n:
            # append what we have in k to the end of what we have in m
            m[lenM:] |= k[:-lenM]
    
            # swap so that m is again the small one
            m, k = k, m
    
            # update the lengths
            lenM, lenK = lenK, lenM+lenK
    
        # see how much m has to be shifted in order to append the missing Trues
        k[n-lenM:] |= m[:-n+lenM]
        return k
    
    
    def op(a, n):
        m = np.isnan(a)
        for x in range(1, n):
            m = np.logical_or(m, np.r_[False, m][:-1])
        return m
    
    
    # all the functions in a list. NB these are the actual functions, not their names
    funcs = [AGML0, AGML1, askewchan0, askewchan1, askewchan2, moarningsun0, swenzel0, swenzel1, op]
    
    def allagree(fns = funcs, n = 10, asize = 100):
        """ make sure result is the same from all functions """
        fnames = [f.__name__ for f in fns]
        a = np.random.rand(asize)
        a[np.random.randint(0, asize, int(asize / 10))] = np.nan
        results = dict([(f.__name__, f(a, n)) for f in fns])
        isgood = [[np.array_equal(results[f1], results[f2]) for f1 in fnames] for f2 in fnames]
        pdgood = pd.DataFrame(isgood, columns = fnames, index = fnames)
        if not all([all(x) for x in isgood]):
            print "not all results identical"
            pdb.set_trace()
        return pdgood
    
    
    def timeone(f):
        """ time one of the functions across the full range of a nd n """
        print "Timing", f.__name__
        Ns = np.array([10**x for x in range(0, 4)]) * 5 # 5 to 5000 window size
        As = [np.random.rand(10 ** x) for x in range(1, 8)] # up to 10 million data data points
        for i in range(len(As)): # 10% of points are always bad
            As[i][np.random.randint(0, len(As[i]), len(As[i]) / 10)] = np.nan
        results = np.array([[Timer(lambda: f(a, n)).timeit(number = 1) if n < len(a) \
                            else np.nan for n in Ns] for a in As])
        pdresults = pd.DataFrame(results, index = [len(x) for x in As], columns = Ns)
        return pdresults
    
    
    def timeall(fns = funcs):
        """ run timeone for all known funcs """
        testd = dict([(x.__name__, timeone(x)) for x in fns])
        testdf = pd.concat(testd.values(), axis = 0, keys = testd.keys())
        testdf.index.names = ["coder", "asize"]
        testdf.columns.names = ["nsize"]
        testdf.reset_index(inplace = True)
        testdf = pd.melt(testdf, id_vars = ["coder", "asize"])
        return testdf
    
    0 讨论(0)
  • 2021-02-15 15:54

    You can use np.ufunc.reduceat with np.bitwise_or:

    import numpy as np
    a = np.array([4, 0, 8, 5, 10, 9, np.nan, 1, 4, 9, 9, np.nan, np.nan, 9,
                  9, 8, 0, 3, 7, 9, 2, 6, 7, 2, 9, 4, 1, 1, np.nan, 10])
    m = np.isnan(a)
    n = 4
    i = np.arange(1, len(m)+1)
    ind = np.column_stack([i-n, i]) # may be a faster way to generate this
    ind.clip(0, len(m)-1, out=ind)
    
    np.bitwise_or.reduceat(m, ind.ravel())[::2]
    

    On your data:

    print np.column_stack([m, reduced])
    [[False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [ True  True]
     [False  True]
     [False  True]
     [False  True]
     [False False]
     [ True  True]
     [ True  True]
     [False  True]
     [False  True]
     [False  True]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [False False]
     [ True  True]
     [False  True]]
    
    0 讨论(0)
  • 2021-02-15 15:56

    You can use the same cumsum trick as you would for a running average filter:

    def cumsum_trick(a, n):
        mask = np.isnan(a)
        cs = np.cumsum(mask)
        cs[n:] -= cs[:-n].copy()
        return cs > 0
    

    Unfortunately the additional .copy() is needed, because of some buffering that goes on internally the order of operations. It is possible to persuade numpy to apply the subtraction in reverse, but for that to work the cs array must have a negative stride:

    def cumsum_trick_nocopy(a, n):
        mask = np.isnan(a)
        cs = np.cumsum(mask, out=np.empty_like(a, int)[::-1])
        cs[n:] -= cs[:-n]
        out = cs > 0
        return out
    

    But this seems fragile and isn't really that much faster anyway.

    I wonder if there's a compiled signal processing function somewhere that does this exact operation..


    For sparse initial masks and small n this one is also pretty fast:

    def index_expansion(a, n):
        mask = np.isnan(a)
        idx = np.flatnonzero(mask)
        expanded_idx = idx[:,None] + np.arange(1, n)
        np.put(mask, expanded_idx, True, 'clip')
        return mask
    
    0 讨论(0)
  • 2021-02-15 16:02

    Something like this?

    maskleft = np.where(np.isnan(a))[0]
    maskright = maskleft + n
    mask = np.zeros(len(a),dtype=bool)
    for l,r in itertools.izip(maskleft,maskright): 
       mask[l:r] = True
    

    Or, since n is small, it might be better to loop over it instead:

    maskleft = np.where(np.isnan(a))[0]
    mask = np.zeros(len(a),dtype=bool)
    for i in range(0,n):
      thismask = maskleft+i
      mask[thismask] = True
    

    Except for the loop over n, the above is fully vectorized. But the loop is fully parallelizable, so you could be able to get a factor-n speedup using e.g. multiprocessing or Cython, if you're willing to go to the trouble.

    Edit: per @askewchan solution 2 can potentially cause out of range errors. It also has indexing problems in the range(0,n). Possible correction:

    maskleft = np.where(np.isnan(a))[0]
    ghost_mask = np.zeros(len(a)+n,dtype=bool)
    for i in range(0, n+1):
        thismask = maskleft + i
        ghost_mask[thismask] = True
    mask = ghost_mask[:len(ghost_mask)-n]
    
    0 讨论(0)
提交回复
热议问题