Finding data gaps with bit masking

问题

I'm faced with a problem of finding discontinuities (gaps) of a given length in a sequence of numbers. So, for example, given [1,2,3,7,8,9,10] and a gap of length=3, I'll find [4,5,6]. If the gap is length=4, I'll find nothing. The real sequence is, of course, much longer. I've seen this problem in quite a few posts, and it had various applications and possible implementations.

One way I thought might work and should be relatively quick is to represent the complete set as a bit array containing 1 for available number and 0 for missing - so the above will look like [1,1,1,0,0,0,1,1,1,1]. Then possibly run a window function that'll XOR mask an array of the given length with the complete set until all locations result in 1. This will require a single pass over the whole sequence in roughly ~O(n), plus the cost of masking in each run.

Here's what I managed to come up with:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

This is fairly quick for ~100k (< 1 sec). I'd appreciate tips on how to make this faster / more efficient for larger sets. thanks!

回答1:

Find the differences between adjacent numbers, and then look for a difference that's large enough. We find the differences by constructing two lists - all the numbers but the first, and all the numbers but the last - and subtracting them pairwise. We can use zip to pair the values up.

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(Also, please learn some Python idioms. We prefer direct iteration, and we have a real boolean type.)

回答2:

You could use XOR and shift and it does run in roughly O(n) time.

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

回答3:

Pretty much what aix did ... but getting only the gaps of the desired length:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

EDIT: Adjusted to the OP's wishes.

回答4:

These provide a single walk of your input list.

List of gap values for given length:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

All gap values:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

List of gaps as vectors:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

Note that these are generators and consume very little memory. I would love to know how these perform on your test dataset.

回答5:

If it's efficiency you're after, I'd do something along the following lines (where x is the list of sequence numbers):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))

来源：https://stackoverflow.com/questions/4375310/finding-data-gaps-with-bit-masking

标签

python

arrays

performance

bitmask