Using regex to match string between two strings while excluding strings

后端 未结 6 1867
隐瞒了意图╮ 2020-12-06 19:41

Following on from a previous question in which I asked:

How can I use a regular expression to match text that is between two strings, where those two

  • 2020-12-06 20:23

    You can replace .*? with


    This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.

    0 讨论(0)
  • 2020-12-06 20:27

    Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

    The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.

    So what do we not want?

    First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:


    This will be to the left of the first |. It matches a whole outer block.

    Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:


    This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.

    Third, we match and capture what we want. This is:


    So the whole regex, in free-spacing mode, is:

    outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
    | # OR (also don't want that)
    | # OR capture what we want

    On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.

    In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:

    (?: # non-capture group: the things we don't want
    outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
    | # OR (also don't want that)
    (*SKIP)(*F) # we don't want this, so fail and skip
    | # OR capture what we want

    See demo: it directly matches what you want.

    The technique is explained in full detail in the question and article below.


    • How to match (or replace) a pattern except in situations s1, s2, s3...
    • Article about matching a pattern unless...
    0 讨论(0)
  • 2020-12-06 20:29

    Try replacing the last .*? with: (?!(.*unwanted text.*))

    Did it work?

    0 讨论(0)
  • 2020-12-06 20:38

    You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:


    The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.

    People may start hating your code if you do much of this though. ;)

    0 讨论(0)
  • 2020-12-06 20:39

    Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)

    However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.

    0 讨论(0)
  • 2020-12-06 20:41

    A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.

    For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.

    0 讨论(0)