Data-structure used for regexes in the Python standard library (re module)

梦想与她 提交于 2020-02-23 07:14:14

问题


My question is: What is the data-structure implemented by the module re in Python (I am interested in any Python implementation, although I have only looked at the source code of CPython and Pypy).

In case you wonder why am I interested, here there is more context:

I have been trying to understand the implementation of the re python module. Currently, I am amazed about how fast it is for finding multiple patterns in a string, compared to other data structures I have used like Suffix Trees and Suffix Arrays (for reference see the discussion of re.search being faster than str.find in Fast way of finding the longest substring using only regex).

In theory, the algorithmic complexity of searching k patterns of m length each, in a string of length n, using re.search should be greater than the O(k*m + n), therefore I was expecting re.search to be slower than an implementation of a Suffix Tree (which I already tried). However, this is not the case and therefore why I become interested in knowing how they managed to implement a searching functionality that works so well even with very long strings.

For reference, those are the implementation files I have reviewed:

  • https://github.com/python/cpython/blob/master/Lib/sre_constants.py
  • https://github.com/python/cpython/blob/master/Lib/sre_compile.py
  • https://github.com/python/cpython/blob/master/Lib/sre_parse.py
  • https://github.com/mozillazg/pypy/blob/master/pypy/module/_sre/interp_sre.py

来源:https://stackoverflow.com/questions/60349511/data-structure-used-for-regexes-in-the-python-standard-library-re-module

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!