I am making a natural language language processing application in Java, I am using data from IMDB and Amazon.
I came across a certain dataset which has words like
There are no English words that I know of that have more than two consecutive identical letters.
This approach would not catch:
partyy
"stoop" (plus that's ambiguous! Is that "stop" with an extra "o" or simply "stoop")
You can use regex to find letters that have same letter after it at least two times (since we don't want to remove correct letters like m
in comma
)
String data="stoooooop partyyyyyy";
System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
// | | |
// group 1 match replace with
// from match from group 1
// group 1
// repeated
// twice or more
Output:
stop party
You can use this snippet its quite fast implementation.
public static String removeConsecutiveChars(String str) {
if (str == null) {
return null;
}
int strLen = str.length();
if (strLen <= 1) {
return str;
}
char[] strChar = str.toCharArray();
char temp = strChar[0];
StringBuilder stringBuilder = new StringBuilder(strLen);
for (int i = 1; i < strLen; i++) {
char val = strChar[i];
if (val != temp) {
stringBuilder.append(temp);
temp = val;
}
}
stringBuilder.append(temp);
return stringBuilder.toString();
}
Try using loop,
String word="Stoooppppd";
StringBuilder res=new StringBuilder();
char first=word.charAt(0);
res.append(first);
for (int i = 1; i < word.length(); i++) {
char ch=word.charAt(i);
if(ch!=first){
res.append(ch);
}
first=ch;
}
System.out.println(res);
You may wish to use \p{L}\p{M}* instead of [a-zA-Z] to include non-English unicode letters as well. So it will be like this: replaceAll("(\\p{L}\\p{M}*)(\\1{" + maxAllowedRepetition + ",})", "$1");
or this: replaceAll("(\\p{L}\\p{M}*)\\1{" + maxAllowedRepetition + ",}", "$1");