How to remove non-valid unicode characters from strings in java

前端 未结 4 2092
感情败类
感情败类 2021-02-15 17:18

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,

4条回答
  •  渐次进展
    2021-02-15 17:21

    Remove specific unwanted chars with:

    document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", "");
    

    If you found others unwanted chars simply add with the same schema to the list.

    UPDATE:

    The unicode chars are splitted by the regex engine in 7 macro-groups (and several sub-groups) identified by a one letter (macro-group) or two letters (sub-group).

    Basing my arguments on your examples and the unicode classes indicated in the always good resource Regular Expressions Site i think you can try a unique only-good-pass approach such as this:

    document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","")
    

    This regex remove anything that is not:

    • \p{L}: a letter in any language
    • \p{N}: a number
    • \p{Z}: any kind of whitespace or invisible separator
    • \p{Sm}\p{Sc}\p{Sk}: Math, Currency or generic marks as single char
    • \p{Mc}*: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Pi}\p{Pf}\p{Pc}*: Opening quote, Closing quote, words connectors (i.e. underscore)

    *: i think these groups can be eligible to be removed as well for the purpose of CoreNPL.

    This way you only need a single regex filter and you can handle groups of chars (with the same purpose) instead of single cases.

提交回复
热议问题