How to remove non-valid unicode characters from strings in java

前端 未结 4 2091
感情败类
感情败类 2021-02-15 17:18

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,

4条回答
  •  太阳男子
    2021-02-15 17:41

    Observed the negative impact in other places when we do replaceAll. So, I propose to replace characters if it is non BPM characters like below

    private String removeNonBMPCharacters(final String input) {
        StringBuilder strBuilder = new StringBuilder();
        input.codePoints().forEach((i) -> {
            if (Character.isSupplementaryCodePoint(i)) {
                strBuilder.append("?");
            } else {
                strBuilder.append(Character.toChars(i));
            }
        });
        return strBuilder.toString();
    }
    

提交回复
热议问题