In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:
private static String processWord(String x)
Here's a late answer, just for fun.
In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:
private static String processWord(String x) {
return x.replaceAll("[][(){},.;!?<>%]", "");
}
This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.
private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");
private static String processWord(String x) {
return UNDESIRABLES.matcher(x).replaceAll("");
}
This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.
Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:
private static final boolean[] CHARS_TO_KEEP = new boolean[];
Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)
Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.
One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.
Although \\p{Punct}
will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:
tmp = tmp.replaceAll("\\p{Punct}+", "");
Use String#replaceAll(String regex, String replacement)
as
tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");
System.out.println(
"f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
"[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"
Right now your code will iterate over all characters of tmp
and compare them with all possible characters that you want to remove, so it will use
number of tmp characters
x number or characters you want to remove
comparisons.
To optimize your code you could use short circuit OR ||
and do something like
StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
|| c == '(' || c == ')' || c == '{' || c == '}' || c == '['
|| c == ']' || c == '<' || c == '>' || c == '%'))
sb.append(c);
}
tmp = sb.toString();
or like this
StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();
outer:
for (char strChar : tmp.toCharArray()) {
for (char badChar : badChars) {
if (badChar == strChar)
continue outer;// we skip `strChar` since it is bad character
}
sb.append(strChar);
}
tmp = sb.toString();
This way you will iterate over every tmp
characters but number of comparisons for that character can decrease if it is not %
(because it will be last comparison, if character would be .
program would get his result in one comparison).
If I am not mistaken this approach is used with character class ([...]
) so maybe try it this way
Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so
//you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");
You can do this:
tmp.replaceAll("\\W", "");
to remove punctuation
Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.