问题
My question is: What is the data-structure implemented by the module re
in Python (I am interested in any Python implementation, although I have only looked at the source code of CPython and Pypy).
In case you wonder why am I interested, here there is more context:
I have been trying to understand the implementation of the re
python module. Currently, I am amazed about how fast it is for finding multiple patterns in a string, compared to other data structures I have used like Suffix Trees and Suffix Arrays (for reference see the discussion of re.search
being faster than str.find
in Fast way of finding the longest substring using only regex).
In theory, the algorithmic complexity of searching k
patterns of m
length each, in a string of length n
, using re.search
should be greater than the O(k*m + n)
, therefore I was expecting re.search
to be slower than an implementation of a Suffix Tree (which I already tried). However, this is not the case and therefore why I become interested in knowing how they managed to implement a searching functionality that works so well even with very long strings.
For reference, those are the implementation files I have reviewed:
- https://github.com/python/cpython/blob/master/Lib/sre_constants.py
- https://github.com/python/cpython/blob/master/Lib/sre_compile.py
- https://github.com/python/cpython/blob/master/Lib/sre_parse.py
- https://github.com/mozillazg/pypy/blob/master/pypy/module/_sre/interp_sre.py
来源:https://stackoverflow.com/questions/60349511/data-structure-used-for-regexes-in-the-python-standard-library-re-module