How do (*SKIP) or (*F) work on regex?

前端 未结 2 1881
有刺的猬
有刺的猬 2020-11-21 06:54

I\'m learning an advanced usage of regex and noticed that many posts use (*SKIP) or (*F) in it.

I posted a question where the idea was to m

2条回答
  •  暗喜
    暗喜 (楼主)
    2020-11-21 07:29

    These two backtracking control verbs are implemented only in Perl, PCRE and the pypi regex module.

    The idea of the (*SKIP)(*FAIL) trick is to consume characters that you want to avoid, and that must not be a part of the match result.

    A classical pattern that uses of this trick looks like that:

    What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
    

    A regex engine processes a string like that:

    • the first token of the pattern is tested on each character from left to right (by default most of the time, but some regex engines can be set to work from right to left, .net can do this if I remember well)

    • if the first token matches, then the regex engine tests the next token of the pattern with the next characters (after the first token match) etc.

    • when a token fails, the regex engine gets the characters matched by the last token back and tries another way to make the pattern succeed (if it doesn't work too, the regex engine do the same with the previous token etc.)

    When the regex engine meets the (*SKIP) verb (in this case all previous tokens have obviously succeeded), it has no right anymore to go back to all the previous tokens on the left and has no right anymore to retry all the matched characters with another branch of the pattern or at the next position in the string until the last matched character (included) if the pattern fails later on the right of the (*SKIP) verb.

    The role of (*FAIL) is to force the pattern to fail. Thus all the characters matched on the left of (*SKIP) are skipped and the regex engine continues its job after these characters.

    The only possibility for the pattern to succeed in the example pattern is that the first branch fails before (*SKIP) to allow the second branch to be tested.

    You can find another kind of explanation here.

    About Java and other regex engines that don't have these two features

    Backtracking control verbs are not implemented in other regex engines and there are no equivalent.

    However, you can use several ways to do the same (to be more clear, to avoid something that can be possibly matched by an other part of the pattern).

    The use of capture groups:

    way 1:

    What_I_want_to_avoid|(What_I_want_to_match)
    

    You only need to extract the capture group 1 (or to test if it exists), since it is what you are looking for. If you use the pattern to perform a replacement, you can use the properties of the match result (offset, length, capture group) to make the replacement with classical string functions. Other language like javascript, ruby... allows to use a callback function as replacement.

    way 2:

    ((?>To_avoid|Other_things_that_can_be_before_what_i_want)*)(What_I_want)
    

    It's the more easy way for the replacement, no need to callback function, the replacement string need only to begin with \1 (or $1)

    The use of lookarounds:

    example, you want to find a word that is not embedded between two other words (lets say S_word and E_word that are different (see Qtax comment)):

    (the edge cases S_word E_word word E_word and S_word word S_word E_word are allowed in this example.)

    The backtracking control verb way will be:

    S_word not_S_word_or_E_word E_word(*SKIP)(*F)|word
    

    To use this way the regex engine needs to allow variable length lookbehinds to a certain extent. With .net or the new regex module, no problems, lookbehinds can have a totally variable length. It is possible with Java too but the size must be limited (example: (?<=.{1,1000})).

    The Java equivalent will be:

    word(?:(?!not_S_word_or_E_word E_word)|(?

    Note that in some cases, only the lookahead is necessary. Note too that starting a pattern with literal character is more efficient than starting with a lookbehind, that's why I putted it after the word (even if I need to rewrite the word one more time in the assertion.)

提交回复
热议问题