问题
I am conducting some research on identify DOI in free format text.
I am using Java 8 and REGEX
I Have found these REGEX's that are supposed to fulfil my requirements
/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i
The code I am trying is
private static final Pattern pattern_one = Pattern.compile("/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern_one.matcher("http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1");
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
However the matcher doesnt find anything.
Where have I gone wrong?
UPDATE
I have encountered a valid DOI that my set of REGEXs do not match
heres an example DOI : 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2
Why doesn't this pattern work?
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
回答1:
In Java, a regex is written as a String. In other languages, the regex is quoted using /.../
, with options like i
given after the ending /
. So, what is written as /XXX/i
will in Java be done like this:
// Using flags parameter
Pattern p = Pattern.compile("XXX", Pattern.CASE_INSENSITIVE);
// Using embedded flags
Pattern p = Pattern.compile("(?i)XXX");
In most languages, regex are using to find a matching substring. Java can do that too, using the find() method (or any of the many replaceXxx()
regex methods), however Java also has the matches() method which will match against the entire string, eliminating the need for the begin and end boundary matchers ^
and $
.
Anyway, your problem is that the regex has both ^
and $
boundary matchers, which means it will only work if string is nothing but the text you want to match. Since you actually want to find a substring, remove those matchers.
To search for one of multiple patterns, using the |
logical regex operator.
And finally, since Java regex is given as a String literal, any special characters, most notably \
, needs to be escaped.
So, to build a single regex that can find substrings matching any of the following:
/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i
You would write it like this:
String regex = "10.\\d{4,9}/[-._;()/:A-Z0-9]+" +
"|10.1002/[^\\s]+" +
"|10.\\d{4}/\\d+-\\d+X?(\\d+)\\d+<[\\d\\w]+:[\\d\\w]*>\\d+.\\d+.\\w+;\\d" +
"|10.1021/\\w\\w\\d++" +
"|10.1207/[\\w\\d]+\\&\\d+_\\d+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String input = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Start index: " + m.start() +
" End index: " + m.end() +
" " + m.group());
}
Output
Start index: 37 End index: 54 10.1175/JPO3002.1
回答2:
Your pattern looks incorrect to me. You are currently using this:
/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i
But I think you intend to use this:
^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$
Problems with your pattern include that you are using JavaScript regex syntax, or some other language's syntax. Also, you were not escaping a literal dot in the regex, and the start of the pattern marker was out of place.
Code:
String pattern = "^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$";
String url = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(url);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
Demo here:
Rextester
来源:https://stackoverflow.com/questions/43683957/whats-the-correct-format-of-java-string-regex-to-identify-doi