Java - removing strange characters from a String

前端未结

关注

 11  646

轮回少年

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?

Updated:

Please tell me the Unicode chara

相关标签:

11条回答

爱一瞬间的悲伤

2020-12-10 03:05
Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:
```
String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}
```
you will end up with a cleaned string "resultStr" haven't tested this but along the lines.
0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-12-10 03:08
Justin Thomas's was close, but this is probably closer to what you're looking for:
```
String nonStrange = strangeString.replaceAll("\\p{Cntrl}", ""); 
```
The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."
0 讨论(0)
发布评论:

提交评论
- 加载中...

悲&欢浪女

2020-12-10 03:13

Most probably the text that you got was encoded in something other than UTF-8. What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

0 讨论(0)

栀梦

2020-12-10 03:14

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-12-10 03:14
filter English ,Chinese,number and punctuation
```
str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-10 03:17
Use String.replaceAll( ):
```
String clean = "♠clean".replaceAll('♠', '');
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页