Could threading or multiprocessing improve performance when analyzing a single string with multiple regular expressions?

北慕城南 提交于 2021-02-07 09:20:57

问题


If I want to analyze a string using dozens of regular-expressions,
could either the threading or multiprocessing module improve performance?
In other words, would analyzing the string on multiple threads or processes be faster than:

match = re.search(regex1, string)
if match:
    afunction(match)
else:
    match = re.search(regex2, string)
    if match:
        bfunction(match)
    else:
        match = re.search(regex3, string)
        if match:
            cfunction(match)
...

No more than one regular expression would ever match, so that's not a concern.
If the answer is multiprocessing, what technique would you recommend looking into (queues, pipes)?


回答1:


Python threading won't improve performance because of the GIL which precludes more than one thread running at a time. If you have a multicore machine, it's possible that multiple processes may speed things up but only if the cost of spawning subprocesses and passing data around is less than the cost of performing your RE searches.

If you do this often, you might look into thread pools.




回答2:


Regexes themselves are a powerful answer. The following example can compose all the regexes into one big regex. (In the example, substitute your regexes for a,b,c and d.

(a?P<A>)|(b?P<B>)|(c?P<C>)|(d?P<D>))

Use lastindex on MatchObject to find out the index of the group that matched. Use groupindex on RegexObject to translate that index to regex name, which is the label in angle brackets (I used uppercase for them in the example above).

Edit: (performance analysis)

In cases where the regexes involved are simple enough to correspond to regular languages and therefore be quick to match by a finite state automaton, this approach will actually result in a performance effect similar to parallel evaluation, surprisingly only consuming one processor resource.

The reason is that the | operator, alongside with ?, *, [], or repetition (but in contrast to most uses of backreferences) is one of operators allowed in regexes that define regular languages. (Notice the "union" in the reference.) Therefore the combined regex is also searchable by a finite state automaton, without any need to backtrack.

Finite state automatons spend only a finite number of operations on each character of the input string. They keep a state (basically a memory pointer) that represents "potential matching progress" at the current position of the input. In case of the combined regex, the FSA is larger and it takes longer to compile (and the memory pointer has more memory locations to point to). But that (creating the Regex object) can be done once upon the startup of the application, and each subsequent input can be searched fast.

Let's compare this to a thread-based parallel execution of individual regexes. The progress through each regex will be similar but not the same for start-of-input-anchored regexes especially because final rejections of non-matching regexes will usually be much faster than successful matches. There is a minor advantage on the side of threads: the fastest match will allow to terminate the whole computation, whereas the combined regex has to complete evaluation of all matches (all groups). In practice the thread pool overhead will more than offset this, and with a large number of threads it will be basically unusable.

The performance benefit of the combined regex technique is thus especially noticeable with a large number of regexes, and is paid for by increased memory consumption.

While it meets this question's preference of parallel matching under the hood, in smaller instances like a handful of regexes it may not be worth the extra complexity of composing the regex.




回答3:


Is performance your concern? If not, just put all REs in an array and loop over it!

for each myRE in myListOfRE
    result = myRE.search(...)
    if result != None:
        something with sqlalchemy
        break

If performance is indeed a concern, I would think multi threading should help. RE match only needs read access to the string being searched, so it should be possible to share it. Though I'm not python expert so can't really tell you how to DO it.



来源:https://stackoverflow.com/questions/9984288/could-threading-or-multiprocessing-improve-performance-when-analyzing-a-single-s

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!