Example input:
RC23
CC23QQ21HD32
BPOASDf91A5HH123
Example output:
You can use lookahead, something like "((?<=%1[0-9]+)|(?=%1[0-9]+))
"
Try this regex: "((?<=[a-zA-Z])(?=[0-9]))|((?<=[0-9])(?=[a-zA-Z]))"
Here's a running example: http://ideone.com/c02rmM
{
...
String someString = "CC23QQ21HD32";
String regex = "((?<=[a-zA-Z])(?=[0-9]))|((?<=[0-9])(?=[a-zA-Z]))";
System.out.println(Arrays.asList(someString.split(regex)));
//outputs [CC, 23, QQ, 21, HD, 32]
...
}
The regex is using lookahead (?=ValueToMatch)
and look behinds (?<=ValueToMatch)
.
The first half of it (before the | ) is asking: "Is the previous character a letter (?<=[a-zA-Z])
? Is the next character a digit (?=[0-9])
?" If both are true, it'll match the string to the regex.
The second half of that regex is doing it the other way around. It asks: "Is the previous character a digit (?<=[0-9])
? Is the next character a letter? (?=[a-zA-Z])
", and again it'll match if both are true.
Normally the split() would remove the characters matched by the regex. This remains true even to this regex. However, since the regex is matching a 0-width lookahead, the actual characters you're looking for are not removed.
Check out Adam Paynter's answer for more on lookaheads and look behinds: how to split string with some separator but without removing that separator in Java?
You can match on 1 or more contiguous alpha characters or 1 or more contiguous numerical characters. Once the sequence is interrupted stop matching, store the sequence an then start over. Non-word characters will be ignored entirely.
Edit: I created a simple performance test below to show the speed between using String.split()
and Pattern.matcher()
. The split version is 2.5x faster than the matcher+loop version.
private static String[] splitAlphaNumeric(String str) {
return str.split("(?i)((?<=[A-Z])(?=\\d))|((?<=\\d)(?=[A-Z]))");
}
import java.util.*;
import java.util.regex.*;
public class SplitAlphaNumeric {
private static final Pattern ALPH_NUM_PAT = Pattern.compile("[0-9]+|[A-Z]+");
private static List<String> input = Arrays.asList(
"RC23",
"CC23QQ21HD32",
"BPOASDf91A5HH123"
);
public static void main(String[] args) {
System.out.printf("Execution time: %dns%n", testMatch());
System.out.printf("Execution time: %dns%n", testSplit());
}
public static long testMatch() {
System.out.println("Begin Test 1...");
long start = System.nanoTime();
for (String str : input) {
System.out.printf("%-16s -> %s%n", str, parse(str));
}
long end = System.nanoTime();
return end - start;
}
public static long testSplit() {
System.out.println("\nBegin Test 2...");
long start = System.nanoTime();
for (String str : input) {
System.out.printf("%-16s -> %s%n", str, parse2(str));
}
long end = System.nanoTime();
return end - start;
}
private static List<String> parse(String str) {
List<String> parts = new LinkedList<String>();
Matcher matcher = ALPH_NUM_PAT.matcher(str);
while (matcher.find()) {
parts.add(matcher.group());
}
return parts;
}
private static List<String> parse2(String str) {
return Arrays.asList(str.split("(?i)((?<=[A-Z])(?=\\d))|((?<=\\d)(?=[A-Z]))"));
}
}
Begin Test 1...
RC23 -> [RC, 23]
CC23QQ21HD32 -> [CC, 23, QQ, 21, HD, 32]
BPOASDf91A5HH123 -> [BPOASD, 91, A, 5, HH, 123]
Execution time: 4879125ns
Begin Test 2...
RC23 -> [RC, 23]
CC23QQ21HD32 -> [CC, 23, QQ, 21, HD, 32]
BPOASDf91A5HH123 -> [BPOASDf, 91, A, 5, HH, 123]
Execution time: 1953349ns