Java Regex - capture string with single dollar, but not when it has two successive ones

一笑奈何 提交于 2020-01-02 08:16:22

问题


I posted this question earlier.

But that wasn't quite the end of it. All the rules that applied there still apply.

So the strings:

  • "%ABC%" would yield ABC as a result (capture stuff between percent signs)
  • as would "$ABC." (capture stuff after $, giving up when another dollar or dot appears)
  • "$ABC$XYZ" would too, and also give XYZ as a result.

To add a bit more to this:

  • "${ABC}" should yield ABC too. (ignore curly braces if present - non capture chars perhaps?).
  • if you have two successive dollar signs, such as "$$EFG", or "$${EFG}",
    that should not appear in a regex result. (This is where either numbered or named back- references come into play - and the reason I contemplated them as non-capture groups). As I understand it, a group becomes a non-capture group with this syntax (?:).

1) Can I say the % or $ is a non-capture group and reference that by number? Or do only capture groups get allocated numbers?

2) What is the order of the numbering, if you have ((A) (B) (C)). Is the outer group 1, A 2, B 3 C 4?

I have been look at named groups. Saw the syntax mentioned here

(?<name>capturing text) to define a named group "name"

\k<name> to backreference a named group "name"

3) Not sure if a non-capture group can be named in Java? Can someone elucidate?

  • More info here on non capture groups.
  • More info here on lookbehinds
  • Similar answer to a question here, but didn't quite get me what I wanted. Not sure if there is a back-reference issue in Java.
  • Similar question here. But could not get my head around the working version to apply to this.

I have used the exact same Java I had in my original question, except for:

String search = "/bla/$V_N.$$XYZ.bla";
String pattern = "(?:(?<oc>[%$]))(?!(\\k<oc>))([^%.$]*)+";

This should only result in V_N.

I am really struggling with this one, and wondered if someone can help me work out how to solve this. Thanks.


回答1:


You may write a little bit more verbose regex with multiple capturing groups and only grab those that are not null, or plainly concatenate the found group values since there will be always only one of them initialized upon each match:

%([^%.]+)%|(?<!\$)\$(?:\{([^{}]+)\}|([^$.]+))

See the regex demo.

Details

  • %([^%.]+)% - %, Group 1: one or more chars other than % and ., then a % is consumed
  • | - or
  • (?<!\$) - a negative lookbehind that matches a location in string that is not immediately preceded with $
  • \$ - a $
  • (?: - start of the non-capturing container group matching either of:
    • \{([^{}]+)\} - {, Group 2: any one or more chars other than { and }, then } is consumed
    • | - or
    • ([^$.]+) - Group 3: 1 or more chars other than $ and .
  • ) - end of the non-capturing container group.

Java usage:

String regex = "%([^%.]+)%|(?<!\\$)\\$(?:\\{([^\\{}]+)\\}|([^$.\\s]+))";
String string = "%ABC%\n$ABC.\n$ABC$XYZ  ${ABC}\n\n$$EFG $${EFG}.";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher m = pattern.matcher(string);
List<String> results = new ArrayList<>();
while (m.find()) {
    results.add(Objects.toString(m.group(1),"") + 
        Objects.toString(m.group(2),"") + 
        Objects.toString(m.group(3),""));
}
System.out.println(results); // => [ABC, ABC, ABC, XYZ, ABC]

Mind that in regular Java string literals, \ should be escaped (i.e. \\) to introduce a single literal backslash that is used as part of regex escapes.



来源:https://stackoverflow.com/questions/58827094/java-regex-capture-string-with-single-dollar-but-not-when-it-has-two-successi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!