I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,
Observed the negative impact in other places when we do replaceAll. So, I propose to replace characters if it is non BPM characters like below
private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}