问题
I want to search a sorted list of strings for all of the elements that start with a given substring.
Here's an example that finds all of the exact matches:
import bisect
names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect.bisect_right(names, 'bob')
print(names[leftIndex:rightIndex])
Which prints ['bob', 'bob', 'bob']
.
Instead, I want to search for all the names that start with 'bob'. The output I want is ['bob', 'bob', 'bob', 'bobby', 'bobert']
. If I could modify the comparison method of the bisect search, then I could use name.startswith('bob')
to do this.
As an example, in Java it would be easy. I would use:
Arrays.binarySearch(names, "bob", myCustomComparator);
where 'myCustomComparator' is a comparator that takes advantage of the startswith method (and some additional logic).
How do I do this in Python?
回答1:
bisect
can be fooled into using a custom comparison by using an instance that uses the custom comparator of your chosing:
>>> class PrefixCompares(object):
... def __init__(self, value):
... self.value = value
... def __lt__(self, other):
... return self.value < other[0:len(self.value)]
...
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert']
>>>
DOH. the right bisect worked, but the left one obviously didn't. "adam" is not prefixed with "bob"!. to fix it, you have to adapt the sequence, too.
>>> class HasPrefix(object):
... def __init__(self, value):
... self.value = value
... def __lt__(self, other):
... return self.value[0:len(other.value)] < other.value
...
>>> class Prefix(object):
... def __init__(self, value):
... self.value = value
... def __lt__(self, other):
... return self.value < other.value[0:len(self.value)]
...
>>> class AdaptPrefix(object):
... def __init__(self, seq):
... self.seq = seq
... def __getitem__(self, key):
... return HasPrefix(self.seq[key])
... def __len__(self):
... return len(self.seq)
...
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> needle = Prefix('bob')
>>> haystack = AdaptPrefix(names)
>>> leftIndex = bisect.bisect_left(haystack, needle)
>>> rightIndex = bisect.bisect_right(haystack, needle)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']
>>>
回答2:
Unfortunately bisect
does not allow you to specify a key
function. What you can do though is add '\xff\xff\xff\xff'
to the string before using it to find the highest index, then take those elements.
回答3:
As an alternative to IfLoop's answer - why not use the __gt__
built-in?
>>> class PrefixCompares(object):
... def __init__(self, value):
... self.value = value
... def __lt__(self, other):
... return self.value < other[0:len(self.value)]
... def __gt__(self, other):
... return self.value[0:len(self.value)] > other
>>> import bisect
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> key = PrefixCompares('bob')
>>> leftIndex = bisect.bisect_left(names, key)
>>> rightIndex = bisect.bisect_right(names, key)
>>> print(names[leftIndex:rightIndex])
['bob', 'bob', 'bob', 'bobby', 'bobert']
回答4:
Coming from functional programming background, I'm flabbergasted that there's not common binary search abstraction to which you can supply custom comparison functions.
To prevent myself from duplicating that thing over and over again or using gross and unreadable OOP hacks, I've simply written an equivalent of the Arrays.binarySearch(names, "bob", myCustomComparator);
function you mentioned:
class BisectRetVal():
LOWER, HIGHER, STOP = range(3)
def generic_bisect(arr, comparator, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(arr)
while lo < hi:
mid = (lo+hi)//2
if comparator(arr, mid) == BisectRetVal.STOP: return mid
elif comparator(arr, mid) == BisectRetVal.HIGHER: lo = mid+1
else: hi = mid
return lo
That was the generic part. And here are the specific comparators for your case:
def string_prefix_comparator_right(prefix):
def parametrized_string_prefix_comparator_right(array, mid):
if array[mid][0:len(prefix)] <= prefix:
return BisectRetVal.HIGHER
else:
return BisectRetVal.LOWER
return parametrized_string_prefix_comparator_right
def string_prefix_comparator_left(prefix):
def parametrized_string_prefix_comparator_left(array, mid):
if array[mid][0:len(prefix)] < prefix: # < is the only diff. from right
return BisectRetVal.HIGHER
else:
return BisectRetVal.LOWER
return parametrized_string_prefix_comparator_left
Here's the code snippet you provided adapted to this function:
>>> names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
>>> names.sort()
>>> leftIndex = generic_bisect(names, string_prefix_comparator_left("bob"))
>>> rightIndex = generic_bisect(names, string_prefix_comparator_right("bob"))
>>> names[leftIndex:rightIndex]
['bob', 'bob', 'bob', 'bobby', 'bobert']
It works unaltered in both Python 2 and Python 3.
For more info on how this works and more comparators for this thing check out this gist: https://gist.github.com/Shnatsel/e23fcd2fe4fbbd869581
回答5:
Here's a solution that hasn't been offered yet: re-implement the binary search algorithm.
This should usually be avoided because you're repeating code (and binary search is easy to mess up), but it seems there's no nice solution.
bisect_left() already gives the desired result, so we only need to change bisect_right(). Here's the original implementation for reference:
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
And here's the new version. The only changes are that I add and not a[mid].startswith(x)
, and I call it "bisect_right_prefix":
def bisect_right_prefix(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid] and not a[mid].startswith(x): hi = mid
else: lo = mid+1
return lo
Now the code looks like this:
names = ['adam', 'bob', 'bob', 'bob', 'bobby', 'bobert', 'chris']
names.sort()
leftIndex = bisect.bisect_left(names, 'bob')
rightIndex = bisect_right_prefix(names, 'bob')
print(names[leftIndex:rightIndex])
Which produces the expected result:
['bob', 'bob', 'bob', 'bobby', 'bobert']
What do you think, is this the way to go?
来源:https://stackoverflow.com/questions/7380629/perform-a-binary-search-for-a-string-prefix-in-python