Capture groups using DFA-based (linear-time) regular expressions: possible?

后端 未结 2 692
挽巷
挽巷 2021-02-14 23:48

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Intuitively I

相关标签:
2条回答
  • 2021-02-14 23:57

    My http://github.com/hoehrmann/demo-parselov does this. I do not currently explain the construction on the web page, but suppose you have a grammar like

    X = "a" B "c"
    B = "b"
    

    You can turn this regular grammar into a graph with labeled vertices

    1. start X
    2. "a"
    3. start B
    4. "b"
    5. final B
    6. "c"
    7. final X

    DFA states correspond to sets of these vertices. The first one would consist of vertices 1 and 2, the second one of vertices 3 and 4, then 5 and 6, and finally 7. If you parse the string "abc", you have

    1. { offset: 0, vertices: [1, 2] }
    2. { offset: 1, vertices: [3, 4] }
    3. { offset: 2, vertices: [5, 6] }
    4. { offset: EOF, vertices: [7] }

    That is also a graph. You can write out the edges using (offset, vertex) pairs as vertices:

    1. (o0, v1) -> (o0, v2)
    2. (o0, v2) -> (o1, v3)
    3. (o1, v3) -> (o1, v4)
    4. ...

    Such a graph might contain vertices that do not ultimately reach the final vertex (EOF, v7), but such vertices can be eliminated in O(n) time. If the grammar is ambiguous, a match would be a path through the resulting graph. There may be many possible paths.

    0 讨论(0)
  • 2021-02-15 00:04

    Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

    Yes - at least when the capture groups are deterministic. Consider the example regex /a|(a)/; matching that against the input "a" could either produce a captured group or none.

    I think that capture groups could be based on a theoretical foundation using finite state transducers, which are like automatons but also may output strings while changing states. You may echo the input, but surround each capture group with parenthesis for example.

    Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

    Indeed, this is a problem. I think you can solve it by tagging your sets with the capture state, and similarly distinguish the states of your result DFA. You may fail to produce a fully deterministic automaton for regular expression like the above, as Wikipedia writes: "some non-deterministic transducers do not admit equivalent deterministic transducers".

    However, a modification of the subset construction procedure is possible, see Determinization of Transducers. Their algorithm seems to revolve around the following:

    local ambiguities […] are solved by delaying the outputs as far as needed, until these symbols can be written out deterministically.

    For example, the regexes /ab|(a)c/ and even /(a[bc])|ad/ can be resolved into deterministic transducers. Notice that their memory representation may be much larger than if they had no capture groups.

    0 讨论(0)
提交回复
热议问题