Searching a sequence in a NumPy array

前端 未结 2 1985
[愿得一人]
[愿得一人] 2020-12-06 08:15

Let\'s say I have the following array :

 array([2, 0, 0, 1, 0, 1, 0, 0])

How do I get the indices where I have occurrence of sequence of va

相关标签:
2条回答
  • 2020-12-06 08:42

    I find that the most succinct, intuitive and general way to do this is using regular expressions.

    import re
    import numpy as np
    
    # Set the threshold for string printing to infinite
    np.set_printoptions(threshold=np.inf)
    
    # Remove spaces and linebreaks that would come through when printing your vector
    yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]
    
    # The next line is the most important, set the arguments in the braces
    # such that the first argument is the shortest sequence you want
    # and the second argument is the longest (using empty as infinite length)
    
    r = re.compile(r"[0]{1,}") 
    zero_starts = [m.start() for m in r.finditer( yourarray_string )]
    zero_ends = [m.end() for m in r.finditer( yourarray_string )]
    
    0 讨论(0)
  • 2020-12-06 08:48

    Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.

    Approach #1: With NumPy, one can create a 2D array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -

    def search_sequence_numpy(arr,seq):
        """ Find sequence in an array using NumPy only.
    
        Parameters
        ----------    
        arr    : input 1D array
        seq    : input 1D array
    
        Output
        ------    
        Output : 1D Array of indices in the input array that satisfy the 
        matching of input sequence in the input array.
        In case of no match, an empty list is returned.
        """
    
        # Store sizes of input array and sequence
        Na, Nseq = arr.size, seq.size
    
        # Range of sequence
        r_seq = np.arange(Nseq)
    
        # Create a 2D array of sliding indices across the entire length of input array.
        # Match up with the input sequence & get the matching starting indices.
        M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)
    
        # Get the range of those indices as final output
        if M.any() >0:
            return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
        else:
            return []         # No match found
    

    Approach #2: With OpenCV (cv2), we have a built-in function for template-matching : cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2 :

    from cv2 import matchTemplate as cv2m
    
    def search_sequence_cv2(arr,seq):
        """ Find sequence in an array using cv2.
        """
    
        # Run a template match with input sequence as the template across
        # the entire length of the input array and get scores.
        S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)
    
        # Now, with floating point array cases, the matching scores might not be 
        # exactly zeros, but would be very small numbers as compared to others.
        # So, for that use a very small to be used to threshold the scorees 
        # against and decide for matches.
        thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.
    
        # Find the matching indices
        idx = np.where(S.ravel() < thresh)[0]
    
        # Get the range of those indices as final output
        if len(idx)>0:
            return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
        else:
            return []         # No match found
    

    Sample run

    In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])
    
    In [513]: seq = np.array([0,0])
    
    In [514]: search_sequence_numpy(arr,seq)
    Out[514]: array([1, 2, 3, 4, 8, 9])
    
    In [515]: search_sequence_cv2(arr,seq)
    Out[515]: array([1, 2, 3, 4, 8, 9])
    

    Runtime test

    In [477]: arr = np.random.randint(0,9,(100000))
         ...: seq = np.array([3,6,8,4])
         ...: 
    
    In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
    Out[478]: True
    
    In [479]: %timeit search_sequence_numpy(arr,seq)
    100 loops, best of 3: 11.8 ms per loop
    
    In [480]: %timeit search_sequence_cv2(arr,seq)
    10 loops, best of 3: 20.6 ms per loop
    

    Seems like the Pure NumPy based one is the safest and fastest!

    0 讨论(0)
提交回复
热议问题