Perform a binary search for a string prefix in Python

笑着哭i 提交于 2021-02-07 08:13:45

问题


I want to search a sorted list of strings for all of the elements that start with a given substring.

Here's an example that finds all of the exact matches:

import bisect
names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect.bisect_right(names, 'bob')
print(names[leftIndex:rightIndex])

Which prints ['bob', 'bob', 'bob'].

Instead, I want to search for all the names that start with 'bob'. The output I want is ['bob', 'bob', 'bob', 'bobby', 'bobert']. If I could modify the comparison method of the bisect search, then I could use name.startswith('bob') to do this.

As an example, in Java it would be easy. I would use:

Arrays.binarySearch(names, "bob", myCustomComparator);

where 'myCustomComparator' is a comparator that takes advantage of the startswith method (and some additional logic).

How do I do this in Python?


回答1:


bisect can be fooled into using a custom comparison by using an instance that uses the custom comparator of your chosing:

>>> class PrefixCompares(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other[0:len(self.value)]
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 

DOH. the right bisect worked, but the left one obviously didn't. "adam" is not prefixed with "bob"!. to fix it, you have to adapt the sequence, too.

>>> class HasPrefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value[0:len(other.value)] < other.value
... 
>>> class Prefix(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other.value[0:len(self.value)]
... 
>>> class AdaptPrefix(object):
...     def __init__(self, seq):
...         self.seq = seq
...     def __getitem__(self, key):
...         return HasPrefix(self.seq[key])
...     def __len__(self):
...         return len(self.seq)
... 
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> needle = Prefix('bob')
>>> haystack = AdaptPrefix(names)
>>> leftIndex = bisect.bisect_left(haystack, needle)
>>> rightIndex = bisect.bisect_right(haystack, needle)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']
>>> 



回答2:


Unfortunately bisect does not allow you to specify a key function. What you can do though is add '\xff\xff\xff\xff' to the string before using it to find the highest index, then take those elements.




回答3:


As an alternative to IfLoop's answer - why not use the __gt__ built-in?

>>> class PrefixCompares(object):
...     def __init__(self, value):
...         self.value = value
...     def __lt__(self, other):
...         return self.value < other[0:len(self.value)]
...     def __gt__(self, other):
...         return self.value[0:len(self.value)] > other
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']



回答4:


Coming from functional programming background, I'm flabbergasted that there's not common binary search abstraction to which you can supply custom comparison functions.

To prevent myself from duplicating that thing over and over again or using gross and unreadable OOP hacks, I've simply written an equivalent of the Arrays.binarySearch(names, "bob", myCustomComparator); function you mentioned:

class BisectRetVal():
    LOWER, HIGHER, STOP = range(3)

def generic_bisect(arr, comparator, lo=0, hi=None): 
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(arr)
    while lo < hi:
        mid = (lo+hi)//2
        if comparator(arr, mid) == BisectRetVal.STOP: return mid
        elif comparator(arr, mid) == BisectRetVal.HIGHER: lo = mid+1
        else: hi = mid
    return lo

That was the generic part. And here are the specific comparators for your case:

def string_prefix_comparator_right(prefix):
    def parametrized_string_prefix_comparator_right(array, mid):
        if array[mid][0:len(prefix)] <= prefix:
            return BisectRetVal.HIGHER
        else:
            return BisectRetVal.LOWER
    return parametrized_string_prefix_comparator_right


def string_prefix_comparator_left(prefix):
    def parametrized_string_prefix_comparator_left(array, mid):
        if array[mid][0:len(prefix)] < prefix: # < is the only diff. from right
            return BisectRetVal.HIGHER
        else:
            return BisectRetVal.LOWER
    return parametrized_string_prefix_comparator_left

Here's the code snippet you provided adapted to this function:

>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> leftIndex = generic_bisect(names, string_prefix_comparator_left("bob"))
>>> rightIndex = generic_bisect(names, string_prefix_comparator_right("bob"))
>>> names[leftIndex:rightIndex]
['bob', 'bob', 'bob', 'bobby', 'bobert']

It works unaltered in both Python 2 and Python 3.

For more info on how this works and more comparators for this thing check out this gist: https://gist.github.com/Shnatsel/e23fcd2fe4fbbd869581




回答5:


Here's a solution that hasn't been offered yet: re-implement the binary search algorithm.

This should usually be avoided because you're repeating code (and binary search is easy to mess up), but it seems there's no nice solution.

bisect_left() already gives the desired result, so we only need to change bisect_right(). Here's the original implementation for reference:

def bisect_right(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid]: hi = mid
        else: lo = mid+1
    return lo

And here's the new version. The only changes are that I add and not a[mid].startswith(x), and I call it "bisect_right_prefix":

def bisect_right_prefix(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid] and not a[mid].startswith(x): hi = mid
        else: lo = mid+1
    return lo

Now the code looks like this:

names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect_right_prefix(names, 'bob')
print(names[leftIndex:rightIndex])

Which produces the expected result:

['bob', 'bob', 'bob', 'bobby', 'bobert']

What do you think, is this the way to go?



来源:https://stackoverflow.com/questions/7380629/perform-a-binary-search-for-a-string-prefix-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!