Need regexp to find substring between two tokens

问题

I suspect this has already been answered somewhere, but I can't find it, so...

I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)

myString = "A=abc;B=def_3%^123+-;C=123;"  ;

myB = getInnerString(myString, "B=", ";" )  ;

method getInnerString(inStr, startToken, endToken){
   return inStr.replace( EXPRESSION, "$1");
}

so, when I run this using expression ".+B=(.+);.+" I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.

I've tried using (?=) in search of that first ';' but it gives me the same result.

I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.

any and all help greatly appreciated.

回答1:

You're using a greedy pattern by not specifying the ? in it. Try this:

".+B=(.+?);.+"

回答2:

Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

回答3:

(This is a continuation of the conversation from the comments to Evan's answer.)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

来源：https://stackoverflow.com/questions/489567/need-regexp-to-find-substring-between-two-tokens

标签

regex

regex-greedy