I\'ve got a string like
s=\"abc, 3rncd (23uh, sdfuh), 32h(q23q)89 (as), dwe8h, edt (1,wer,345,rtz,tr t), nope\";
and I want to split it int
Assuming (
and )
are not nested and unescaped. You can use split using:
String[] arr = input.split(",(?![^()]*\\))\\s*");
RegEx Demo
,(?![^()]*\))
will match a comma if it is NOT followed by a non-parentheses text and )
, thus ignoring commas inside (
and )
.
Even this will work for you.
public static void main(String[] args) {
String s="abc, 3rncd (23uh, sdfuh), 32h(q23q)89 (as), dwe8h, edt (1,wer,345,rtz,tr t), nope";
String[] arr = s.split(",\\s(?!\\w+\\))");
for (String str : arr) {
System.out.println(str);
}
}
O/P :
abc
3rncd (23uh, sdfuh)
32h(q23q)89 (as)
dwe8h
edt (1,wer,345,rtz,tr t)
nope
FWIW: I wouldn't use the lookahead solution for this.
If you have a lot of commas, the lookahead will have a latency that is
logarithmic, relative to the amount of commas.
The reason is that a lookahead used like this can be open ended.
If there is a posibility that there could be nothing to terminating the lookaead
it's not a good idea. Especially on a large sample of data.
Every time the regex finds a comma, it has to do this (?![^()]*\))
What that does is lookahead until it finds parenthesis.
That means it will match comma's as well.
If you have a string like this asdf,asdf,asdf,aasdf,aaaasdf,asdf,aasdf,asdf
the progression is
Match 1: found ,
looked ahead at all of this asdf,asdf,aasdf,aaaasdf,asdf,aasdf,asdf
Match 2: found ,
looked ahead at all of this asdf,aasdf,aaaasdf,asdf,aasdf,asdf
Match 3: found ,
looked ahead at all of this aasdf,aaaasdf,asdf,aasdf,asdf
Match 4: found ,
looked ahead at all of this aaaasdf,asdf,aasdf,asdf
Match 5: found ,
looked ahead at all of this asdf,aasdf,asdf
Match 6: found ,
looked ahead at all of this aasdf,asdf
Match 7: found ,
looked ahead at all of this asdf
It's a pretty small string to be matching all of that stuff.
It's never good to use a regex like that, for split or any kind of matching.
I'd just match the field values in a global find.
"(?:\\A|\\G,\\s*)([^(),]*(?:(?:\\([^()]*\\))[^(),]*)*)"
Here is a simple benchmark that demonstrates the said latency using
a lookahead like this can cause:
Sample: 260 characters, 42 commas
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
asdf,asdf,asdf,asdf,asdf,asdf,asdf,
Benchmark
Regex1: (?:\A|\G,\s*)([^(),]*(?:(?:\([^()]*\))[^(),]*)*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 50
Elapsed Time: 2.97 s, 2972.45 ms, 2972454 µs
Regex2: ,(?![^()]*\))\s*
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 49
Elapsed Time: 21.59 s, 21586.81 ms, 21586811 µs
When the sample is doubled, the time gets ever worse..
Regex1: (?:\A|\G,\s*)([^(),]*(?:(?:\([^()]*\))[^(),]*)*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 99
Elapsed Time: 5.89 s, 5887.16 ms, 5887163 µs
Regex2: ,(?![^()]*\))\s*
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 98
Elapsed Time: 83.06 s, 83063.77 ms, 83063772 µs