问题
Let's say I have a string and a list of strings:
a = 'ABCDEFG'
b = ['ABC', 'QRS', 'AHQ']
How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']
The most important issue is that I have tens of millions of strings, so that time efficiency is essential.
回答1:
If you only want the first match in b:
next((s for s in b if s in a), None)
This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None
.
回答2:
Keep in mind that Python's substring search x in a
is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.
However, if you have a more specialized case, you can do much better.
For example, if you have an arbitrary list of millions of strings b
that all need to be searched for within one giant static string a
that never changes, preprocessing a
can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)
On the other hand, if you expect matches to be unlikely, and you've got the whole b
list in advance, you can probably get some large gains by organizing b
in some way. For example, there's no point searching for "ABCD"
if "ABC"
already failed; if you need to search both "ABC"
and "ABD"
you can search for "AB"
first and then check whether it's followed by "C"
or "D"
so you don't have to repeat yourself; etc. (It might even be possible to merge all of b
into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)
But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.
Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.
回答3:
Answer:
(x for x in b if x in a )
That will return a generator that will be a list of ones that match. Take the first or loop over it.
回答4:
In [3]: [s for s in b if s in a]
Out[3]: ['ABC']
On my machine this takes about 3 seconds when b
contains 20,000,000 elements (tested with a
and b
containing strings similar to those in the question).
回答5:
You might want to have a look at the following algorithm:
Boyer–Moore string search algorithm And wikipedia
But without knowing more, this might be overkill!
来源:https://stackoverflow.com/questions/14389010/python-intersection-of-full-string-from-list-with-partial-string