Python: Intersection of full string from list with partial string

廉价感情. 提交于 2020-01-13 03:36:48

问题


Let's say I have a string and a list of strings:

a = 'ABCDEFG'

b = ['ABC', 'QRS', 'AHQ']

How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']

The most important issue is that I have tens of millions of strings, so that time efficiency is essential.


回答1:


If you only want the first match in b:

next((s for s in b if s in a), None)

This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None.




回答2:


Keep in mind that Python's substring search x in a is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.

However, if you have a more specialized case, you can do much better.

For example, if you have an arbitrary list of millions of strings b that all need to be searched for within one giant static string a that never changes, preprocessing a can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)

On the other hand, if you expect matches to be unlikely, and you've got the whole b list in advance, you can probably get some large gains by organizing b in some way. For example, there's no point searching for "ABCD" if "ABC" already failed; if you need to search both "ABC" and "ABD" you can search for "AB" first and then check whether it's followed by "C" or "D" so you don't have to repeat yourself; etc. (It might even be possible to merge all of b into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)

But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.

Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.




回答3:


Answer:

(x for x in b if x in a )

That will return a generator that will be a list of ones that match. Take the first or loop over it.




回答4:


In [3]: [s for s in b if s in a]
Out[3]: ['ABC']

On my machine this takes about 3 seconds when b contains 20,000,000 elements (tested with a and b containing strings similar to those in the question).




回答5:


You might want to have a look at the following algorithm:

Boyer–Moore string search algorithm And wikipedia

But without knowing more, this might be overkill!



来源:https://stackoverflow.com/questions/14389010/python-intersection-of-full-string-from-list-with-partial-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!