How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?
Updated:
Please tell me the Unicode chara
Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:
String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes
for (String s : badChar) {
String resultStr = str.replaceAll(s, str);
}
you will end up with a cleaned string "resultStr" haven't tested this but along the lines.
Justin Thomas's was close, but this is probably closer to what you're looking for:
String nonStrange = strangeString.replaceAll("\\p{Cntrl}", "");
The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."
Most probably the text that you got was encoded in something other than UTF-8. What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:
try {
CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);
return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
// throw an exception saying the file was not saved with UTF-8 encoding.
}
A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.
You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.
filter English ,Chinese,number and punctuation
str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");
Use String.replaceAll( ):
String clean = "♠clean".replaceAll('♠', '');