How do regular expressions work behind the scenes (at the CPU level)?

后端 未结 3 1929
旧时难觅i
旧时难觅i 2021-02-09 12:30

Do interpreters and compilers compare (and ultimately match) two strings for a potential match in a character-by-character and left-to-right fashion? Or is there an underlying

相关标签:
3条回答
  • 2021-02-09 12:38

    The way regular expressions work is an implementation detail. They can be implemented one way, or second way.

    In fact, some of the languages implement them inefficiently.

    If you want to understand more, I can refer you to this article: https://swtch.com/~rsc/regexp/regexp1.html

    0 讨论(0)
  • 2021-02-09 12:45

    Regular expressions don't specify implementation details. It's up to the language to decide the best way to compile them to machine code. For example, this regex.c implementation looks like it goes more than one character at a time.

    0 讨论(0)
  • 2021-02-09 12:52

    There are two big families of regex engines, called NFA and DFA (I'm using the terminology from Jeffrey Friedl's book):

    • Nondeterministic finite automaton
    • Deterministic finite automaton

    A NFA implementation will roughly work the following way:

    • Keep a pointer to a current offset in the input string
    • Keep a pointer to the current position in the pattern (which is interpreted as a graph or tree).

    Then use the pattern as a recipe of how to advance in the input string. If the pattern says a for instance, and if the current input offset points to an a character, then that character will be consumed and both pointers will advance to the next position. If the character doesn't match, there will be a backtrack (the input pointer will go to a previous valid position and the pattern pointer will be set to a different possible alternative at the input position).

    The point is that the recognition is driven by the pattern.

    (the above explanation is very rough, as it doesn't include optimizations etc - and modern patterns cannot be implemented with a formal automaton anyway)

    A DFA implementation works the other way around:

    There is still one input pointer, but there are multiple pattern pointers. The input pattern will advance character by character, and the pattern pointers will keep track of a valid state in the pattern for the given input.

    The recognition is driven by the input.

    Both these methods have very different properties:

    • NFA engines can offer much more features, but their running time is dependent on the combination of the input and the pattern itself
    • DFA engines offer less features, but their complexity is O(n), where n is the length of the input string.

    Some regex engines (such as PCRE) can implement both recognition methods. I recommend you read the PCRE docs about the two matching algorithms, which explain the differences in more technical terms.

    As to the actual implementation, it highly depends on the regex engine itself. PCRE has several of them:

    • A NFA algorithm based on a tree traversal approach
    • An optimized version of the above based on JIT compilation (one version for each supported instruction set)
    • A DFA implementation

    So you can actually see there are several possible approaches for NFA alone. Other engines have different implementations that allow for a different feature set. For instance, .NET's regexes can be matched left-to-right, or right-to-left and thus can easily provide variable-length lookbehind, whereas PCRE's lookbehind is implemented by temporarily shifting the input pointer to the left by the lookbehind's expected input length, and performing a left-to-right match from this offset.

    0 讨论(0)
提交回复
热议问题