variable length masking with preg_replace

前端 未结 3 644
孤城傲影
孤城傲影 2020-12-10 04:43

I am masking all characters between single quotes (inclusively) within a string using preg_replace_callback(). But I would like to only use preg_replace(

相关标签:
3条回答
  • 2020-12-10 04:53

    Well, just for the fun of it and I seriously wouldn't recommend something like that because I try to avoid lookarounds when they are not necessary, here's one regex that uses the concept of 'back to the future':

    (?<=^|\s)'(?!\s)|(?!^)(?<!'(?=\s))\G.
    

    regex101 demo

    Okay, it's broken down into two parts:

    1. Matching the beginning single quote

    (?<=^|\s)'(?!\s)
    

    The rules that I believe should be established here are:

    1. There should be either ^ or \s before the beginning quote (hence (?<=^|\s)).
    2. There is no \s after the beginning quote (hence (?!\s)).

    2. Matching the things inside the quote, and the ending quote

    (?!^)\G(?<!'(?=\s)).
    

    The rules that I believe should be established here are:

    1. The character can be any character (hence .)
    2. The match is 1 character long and following the immediate previous match (hence (?!^)\G).
    3. There should be no single quote, that is itself followed by a space, before it (hence
      (?<!'(?=\s)) and this is the 'back to the future' part). This effectively will not match a \s that is preceded by a ' and will mark the end of the characters wrapped between single quotes. In other words, the closing quote will be identified as a single quote followed by \s.

    If you prefer pictures...

    img

    0 讨论(0)
  • 2020-12-10 04:56

    Short answer : It's possible !!!

    Use the following pattern

    '                                     # Match a single quote
    (?=                                   # Positive lookahead, this basically makes sure there is an odd number of single quotes ahead in this line
       (?:(?:[^'\r\n]*'){2})*   # Match anything except single quote or newlines zero or more times followed by a single quote, repeat this twice and repeat this whole process zero or more times (basically a pair of single quotes)
       (?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # You guessed, this is to match a single quote until the end of line
    )
    |                                     # or
    \G(?<!^)                              # Preceding contiguous match (not beginning of line)
    [^']                                  # Match anything that's not a single quote
    (?=                                   # Same as above
       (?:(?:[^'\r\n]*'){2})*             # Same as above
       (?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # Same as above
    )
    |
    \G(?<!^)                              # Preceding contiguous match (not beginning of line)
    '                                     # Match a single quote
    

    Make sure to use the m modifier.

    Online demo.

    Long answer : It's a pain :)

    Unless not only you but your whole team loves regex, you might think of using this regex but remember that this is insane and quite difficult to grasp for beginners. Also readability goes (almost) always first.

    I'll break the idea of how I did write such a regex:

    1) We first need to know what we actually want to replace, we want to replace every character (including the single quotes) that's between two single quotes with a hyphen.
    2) If we're going to use preg_replace() that means our pattern needs to match one single character each time.
    3) So the first step would be obvious : '.
    4) We'll use \G which means match beginning of string or the contiguous character that we matched earlier. Take this simple example ~a|\Gb~. This will match a or b if it's at the beginning or b if the previous match was a. See this demo.
    5) We don't want anything to do with beginning of string So we'll use \G(?<!^).
    6) Now we need to match anything that's not a single quote ~'|\G(?<!^)[^']~.
    7) Now begins the real pain, how do we know that the above pattern wouldn't go match c in 'ab'c ? Well it will, we need to count the single quotes...

    Let's recap:

    a 'bcd' efg 'hij'
      ^ It will match this first
       ^^^ Then it will match these individually with \G(?<!^)[^']
          ^ It will match since we're matching single quotes without checking anything
            ^^^^^ And it will continue to match ...
    

    What we want could be done in those 3 rules:

    a 'bcd' efg 'hij'
    1 ^ Match a single quote only if there is an odd number of single quotes ahead
    2  ^^^ Match individually those characters only if there is an odd number of single quotes ahead
    3     ^ Match a single quote only if there was a match before this character
    

    8) Checking if there is an odd number of single quotes could be done if we knew how to match an even number :

    (?:              # non-capturing group
       (?:           # non-capturing group
          [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
          '          # Match a single quote
       ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
    )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
    

    9) An odd number would be easy now, we just need to add :

    (?:
       [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
       '             # Match a single quote
       [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
       (?:\r?\n|$)   # End of line
    )
    

    10) Merging above in a single lookahead:

    (?=
       (?:              # non-capturing group
          (?:           # non-capturing group
             [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
             '          # Match a single quote
          ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
       )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
       (?:
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          '             # Match a single quote
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          (?:\r?\n|$)   # End of line
       )
    )
    

    11) Now we need to merge all 3 rules we defined earlier:

    ~                   # A modifier
    #################################### Rule 1 ####################################
    '                   # A single quote
    (?=                 # Lookahead to make sure there is an odd number of single quotes ahead
       (?:              # non-capturing group
          (?:           # non-capturing group
             [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
             '          # Match a single quote
          ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
       )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
       (?:
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          '             # Match a single quote
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          (?:\r?\n|$)   # End of line
       )
    )
    
    |                   # Or
    
    #################################### Rule 2 ####################################
    \G(?<!^)            # Preceding contiguous match (not beginning of line)
    [^']                # Match anything that's not a single quote
    (?=                 # Lookahead to make sure there is an odd number of single quotes ahead
       (?:              # non-capturing group
          (?:           # non-capturing group
             [^'\r\n]*  # Match anything that's not a single quote or newline, zero or more times
             '          # Match a single quote
          ){2}          # Repeat 2 times (We'll be matching 2 single quotes)
       )*               # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
       (?:
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          '             # Match a single quote
          [^'\r\n]*     # Match anything that's not a single quote or newline, zero or more times
          (?:\r?\n|$)   # End of line
       )
    )
    
    |                   # Or
    
    #################################### Rule 3 ####################################
    \G(?<!^)            # Preceding contiguous match (not beginning of line)
    '                   # Match a single quote
    ~x
    

    Online regex demo. Online PHP demo

    0 讨论(0)
  • 2020-12-10 05:09

    Yes you can do it, (assuming that quotes are balanced) example:

    $str = "TEST 'replace''me' ok 'me too'";
    $pattern = "~[^'](?=[^']*(?:'[^']*'[^']*)*+'[^']*\z)|'~";    
    $result = preg_replace($pattern, '-', $str);
    

    The idea is: you can replace a character if it is a quote or if it is followed by an odd number of quotes.

    Without quotes:

    $pattern = "~(?:(?!\A)\G|(?:(?!\G)|\A)'\K)[^']~";
    $result = preg_replace($pattern, '-', $str);
    

    The pattern will match a character only when it is contiguous to a precedent match (In other words, when it is immediately after the last match) or when it is preceded by a quote that is not contiguous to the precedent match.

    \G is the position after the last match (at the beginning it is the start of the string)

    pattern details:

    ~             # pattern delimiter
    
    (?: # non capturing group: describe the two possibilities
        # before the target character
    
        (?!\A)\G  # at the position in the string after the last match
                  # the negative lookbehind ensure that this is not the start
                  # of the string
    
      |           # OR
    
        (?:       # (to ensure that the quote is a not a closing quote)
            (?!\G)   # not contiguous to a precedent match
          |          # OR
            \A       # at the start of the string
        )
        '         # the opening quote
    
        \K        # remove all precedent characters from the match result
                  # (only one quote here)
    )
    
    [^']          # a character that is not a quote
    
    ~
    

    Note that since the closing quote is not matched by the pattern, the following characters that are not quotes can't be matched because there is no precedent match.

    EDIT:

    The (*SKIP)(*FAIL) way:

    Instead of testing if a single quote is not a closing quote with (?:(?!\G)|\A)' like in the precedent pattern, you can break the match contiguity on closing quotes using the backtracking control verbs (*SKIP) and (*FAIL) (That can be shorten to (*F)).

    $pattern = "~(?:(?!\A)\G|')(?:'(*SKIP)(*F)|\K[^'])~";
    $result = preg_replace($pattern, '-', $str);
    

    Since the pattern fails on each closing quotes, the following characters will not be matched until the next opening quote.

    The pattern may be more efficient written like this:

    $pattern = "~(?:\G(?!\A)(?:'(*SKIP)(*F))?|'\K)[^']~";
    

    (You can also use (*PRUNE) in place of (*SKIP).)

    0 讨论(0)
提交回复
热议问题