Need regexp to find substring between two tokens

断了今生、忘了曾经 提交于 2019-12-20 01:36:12

问题


I suspect this has already been answered somewhere, but I can't find it, so...

I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)

myString = "A=abc;B=def_3%^123+-;C=123;"  ;

myB = getInnerString(myString, "B=", ";" )  ;

method getInnerString(inStr, startToken, endToken){
   return inStr.replace( EXPRESSION, "$1");
}

so, when I run this using expression ".+B=(.+);.+" I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.

I've tried using (?=) in search of that first ';' but it gives me the same result.

I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.

any and all help greatly appreciated.


Similar question on SO:

  • Regex: To pull out a sub-string between two tags in a string
  • Regex to replace all \n in a String, but no those inside [code] [/code] tag
  • Replace patterns that are inside delimiters using a regular expression call
  • RegEx matching HTML tags and extracting text

回答1:


You're using a greedy pattern by not specifying the ? in it. Try this:

".+B=(.+?);.+" 



回答2:


Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.




回答3:


(This is a continuation of the conversation from the comments to Evan's answer.)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.



来源:https://stackoverflow.com/questions/489567/need-regexp-to-find-substring-between-two-tokens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!