Finding data gaps with bit masking

天涯浪子 提交于 2019-12-04 17:41:11

Find the differences between adjacent numbers, and then look for a difference that's large enough. We find the differences by constructing two lists - all the numbers but the first, and all the numbers but the last - and subtracting them pairwise. We can use zip to pair the values up.

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(Also, please learn some Python idioms. We prefer direct iteration, and we have a real boolean type.)

You could use XOR and shift and it does run in roughly O(n) time.

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

Pretty much what aix did ... but getting only the gaps of the desired length:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

EDIT: Adjusted to the OP's wishes.

These provide a single walk of your input list.

List of gap values for given length:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

All gap values:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

List of gaps as vectors:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

Note that these are generators and consume very little memory. I would love to know how these perform on your test dataset.

If it's efficiency you're after, I'd do something along the following lines (where x is the list of sequence numbers):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!