How does this PCRE pattern detect palindromes?

前端 未结 1 1324
清歌不尽
清歌不尽 2020-12-05 14:12

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindrome

相关标签:
1条回答
  • 2020-12-05 15:08

    Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

    ^(.)(.)(.) ... \3\2\1$
    

    we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

    ^(.)(?=.*\1$)
     (.)(?=.*\2\1$)
     (.)(?=.*\3\2\1$) ...
    

    but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

    ^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
     (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
     (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
     ...
    

    which could be converted into

    ^(?: 
        (.)(?=.*(\1\2)$)
     )*
    

    Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

    (?(2)\2|)   # matches \2 if it exist, empty otherwise.
    

    so our expression becomes

    ^(?: 
        (.)(?=.*(\1(?(2)\2|))$)
     )*
    

    Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

    ^(?: 
        (.)(?=.*(\1(?(2)\2|))$)
     )*\2$
    

    We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

    ^(?: 
        (.)(?=.*(\1(?(2)\2|))$)
     )*.?\2$
    

    This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

    ^(?: 
        (.)(?=.*(\1(?(2)\2|))$)
     )*.?\2?$
    #      ^ since \2 must be at the end in the look-ahead anyway.
    
    0 讨论(0)
提交回复
热议问题