I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,
Remove specific unwanted chars with:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");
If you found others unwanted chars simply add with the same schema to the list.
UPDATE:
The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).
Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:
document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")
This regex remove anything that is not:
\p{L}
: a letter in any language\p{N}
: a number\p{Z}
: any kind of whitespace or invisible separator\p{Sm}\p{Sc}\p{Sk}
: Math, Currency or generic marks as single char\p{Mc}*
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).\p{Pi}\p{Pf}\p{Pc}*
: Opening quote, Closing quote, words connectors (i.e. underscore)*
: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.
This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.