Let\'s say I have the following array :
array([2, 0, 0, 1, 0, 1, 0, 0])
How do I get the indices where I have occurrence of sequence of va
I find that the most succinct, intuitive and general way to do this is using regular expressions.
import re
import numpy as np
# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)
# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]
# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)
r = re.compile(r"[0]{1,}")
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]
Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.
Approach #1: With NumPy, one can create a 2D
array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True
rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -
def search_sequence_numpy(arr,seq):
""" Find sequence in an array using NumPy only.
Parameters
----------
arr : input 1D array
seq : input 1D array
Output
------
Output : 1D Array of indices in the input array that satisfy the
matching of input sequence in the input array.
In case of no match, an empty list is returned.
"""
# Store sizes of input array and sequence
Na, Nseq = arr.size, seq.size
# Range of sequence
r_seq = np.arange(Nseq)
# Create a 2D array of sliding indices across the entire length of input array.
# Match up with the input sequence & get the matching starting indices.
M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)
# Get the range of those indices as final output
if M.any() >0:
return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
else:
return [] # No match found
Approach #2: With OpenCV (cv2), we have a built-in function for template-matching
: cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2
:
from cv2 import matchTemplate as cv2m
def search_sequence_cv2(arr,seq):
""" Find sequence in an array using cv2.
"""
# Run a template match with input sequence as the template across
# the entire length of the input array and get scores.
S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)
# Now, with floating point array cases, the matching scores might not be
# exactly zeros, but would be very small numbers as compared to others.
# So, for that use a very small to be used to threshold the scorees
# against and decide for matches.
thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.
# Find the matching indices
idx = np.where(S.ravel() < thresh)[0]
# Get the range of those indices as final output
if len(idx)>0:
return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
else:
return [] # No match found
Sample run
In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])
In [513]: seq = np.array([0,0])
In [514]: search_sequence_numpy(arr,seq)
Out[514]: array([1, 2, 3, 4, 8, 9])
In [515]: search_sequence_cv2(arr,seq)
Out[515]: array([1, 2, 3, 4, 8, 9])
Runtime test
In [477]: arr = np.random.randint(0,9,(100000))
...: seq = np.array([3,6,8,4])
...:
In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True
In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop
In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop
Seems like the Pure NumPy based one is the safest and fastest!