How does this regex replacement reverse a string?

后端 未结 1 965
情深已故
情深已故 2020-12-31 03:37

This is the fourth part in a series of educational regex articles. It show how the combination of nested reference (see: How does this regex find trian

1条回答
  •  别那么骄傲
    2020-12-31 03:55

    Overview

    At a high level, the pattern matches any one character ., but additionally performs a grab$2 action, which captures the reversal "mate" of the character that was matched into group 2. This capture is done by building a suffix of the input string whose length matches the length of the prefix up to the current position. We do this by applying assertSuffix on a pattern that grows the suffix by one character, repeating this once forEachDotBehind. Group 1 captures this suffix. The first character of that suffix, captured in group 2, is the reversal "mate" for the character that was matched.

    Thus, replacing each matched character with its "mate" has the effect of reversing a string.


    How it works: a simpler example

    To better understand how the regex pattern works, let's first apply it on a simpler input. Also, for our replacement pattern, we'll just "dump" out all the captured strings so we get a better idea of what's going on. Here's a Java version:

    System.out.println(
        "123456789"
            .replaceAll(REVERSE, "[$0; $1; $2]\n")
    );
    

    The above prints (as seen on ideone.com):

    [1; 9; 9]
    [2; 89; 8]
    [3; 789; 7]
    [4; 6789; 6]
    [5; 56789; 5]
    [6; 456789; 4]
    [7; 3456789; 3]
    [8; 23456789; 2]
    [9; 123456789; 1]
    

    Thus, e.g. [3; 789; 7] means that the dot matched 3 (captured in group 0), the corresponding suffix is 789 (group 1), whose first character is 7 (group 2). Note that 7 is 3's "mate".

                       current position after
                          the dot matched 3
                                  ↓        ________
                          1  2 [3] 4  5  6 (7) 8  9
                          \______/         \______/
                           3 dots        corresponding
                           behind      suffix of length 3
    

    Note that a character's "mate" may be to its right or left. A character may even be its own "mate".


    How the suffix is built: nested reference

    The pattern responsible for matching and building the growing suffix is the following:

        ((.) \1?)
        |\_/    |
        | 2     |       "suffix := (.) + suffix
        |_______|                    or just (.) if there's no suffix"
            1
    

    Note that within the definition of group 1 is a reference to itself (with \1), though it is optional (with ?). The optional part provides the "base case", a way for the group to match without the reference to itself. This is required because an attempt to match a group reference always fails when the group hasn't captured anything yet.

    Once group 1 captures something, the optional part is never exercised in our setup, since the suffix that we just captured last time will still be there this time, and we can always prepend another character to the beginning of this suffix with (.). This prepended character is captured into group 2.

    Thus this pattern attempts to grow the suffix by one dot. Repeating this once forEachDotBehind will therefore results in a suffix whose length is exactly the length of the prefix up to our current position.


    How assertSuffix and forEachDotBehind work: meta-pattern abstractions

    Note that so far we've treated assertSuffix and forEachDotBehind as blackboxes. In fact, leaving this discussion for last is a deliberate act: the names and the brief documentation suggest WHAT they do, and this was enough information for us to write and read our REVERSE pattern!

    Upon closer inspection, we see that the Java and C# implementations of these abstractions slightly differ. This is due to the differences between the two regex engines.

    The .NET regex engine allows full regular expression in a lookbehind, so these meta-patterns look a lot more natural in that flavor.

    • AssertSuffix(pattern) := (?=.*$(?<=pattern)), i.e. we use a lookahead to go all the way to the end of the string, then use a nested lookbehind to match the pattern against a suffix.
    • ForEachDotBehind(assertion) := (?<=(?:.assertion)*), i.e. we simply match .* in a lookbehind, tagging the assertion along with the dot inside a non-capturing group.

    Since Java's doesn't officially support infinite-length lookbehind (but it works anyway under certain circumstances), its counterpart is a bit more awkward:

    • assertSuffix(pattern) := (?<=(?=^.*?pattern$).*), i.e. we use a lookbehind to go all the way to the beginning of the string, then use a nested lookahead to match the entire string, prepending the suffix pattern with .*? to reluctantly match some irrelevant prefix.
    • forEachDotBehind(assertion) := (?<=^(?:.assertion)*?), i.e. we use an anchored lookbehind with reluctant repetition, i.e. ^.*? (and likewise tagging the assertion along with the dot inside a non-capturing group).

    It should be noted that while the C# implementation of these meta-patterns doesn't work in Java, the Java implementation DOES work in C# (see on ideone.com). Thus, there is no actual need to have different implementations for C# and Java, but the C# implementation deliberately took advantage of the more powerful .NET regex engine lookbehind support to express the patterns more naturally.

    We have thus shown the benefits of using meta-pattern abstractions:

    • We can independently develop, examine, test, optimize, etc. these meta-patterns implementations, perhaps taking advantage of flavor-specific features for extra performance and/or readability.
    • Once these building blocks are developed and well-tested, we can simply use them as parts of a bigger pattern, which allows us to express ideas at higher levels for more readable, more maintainable, more portable solutions.
    • Meta-patterns promote reuse, and programmatic generation means there's less duplication

    While this particular manifestation of the concept is rather primitive, it's also possible to take this further and develop a more robust programmatic pattern generation framework, with a library of well-tested and optimized meta-patterns.

    See also

    • Martin Fowler - Composed Regex
    • .NET regular expressions - Balancing group definition - a great example of a meta-pattern!

    Closing thoughts

    It needs to be reiterated that reversing a string with regex is NOT a good idea in practice. It's way more complicated than necessary, and the performance is quite poor.

    That said, this article shows that it CAN in fact be done, and that when expressed at higher levels using meta-pattern abstractions, the solution is in fact quite readable. As a key component of the solution, the nested reference is showcased once again in what is hopefully another engaging example.

    Less tangibly, perhaps the article also shows the determination required to solve a problem that may seem difficult (or even "impossible") at first. Perhaps it also shows the clarity of thought that comes with a deeper understanding of a subject matter, a result of numerous studies and hard work.

    No doubt regex can be an intimidating subject, and certainly it's not designed to solve all of your problems. This is no excuse for hateful ignorance, however, and this is one surprisingly deep well of knowledge if you're willing to learn.

    0 讨论(0)
提交回复
热议问题