How to remove non-valid unicode characters from strings in java

前端 未结 4 2095
感情败类
感情败类 2021-02-15 17:18

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,

4条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-02-15 17:21

    In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.

    document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
    

    seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:

    document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
    

    So right now I am running two replaceAll() commands before handing the document to the parser. The complete code snippet is

    // remove invalid unicode characters
    String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
    // remove other unicode characters coreNLP can't handle
    String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
    DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));
    for (List sentence : tokenizer) {
        List tagged = tagger.tagSentence(sentence);
        GrammaticalStructure gs = parser.predict(tagged);
        System.err.println(gs);
    }
    

    This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.

    Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.

    UPDATE Nov 27ths

    Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;. Take this code example to tokenize a document:

    DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));
    TokenizerFactory factory=null;
    factory=PTBTokenizer.factory();
    factory.setOptions("untokenizable=noneDelete");
    tokenizer.setTokenizerFactory(factory);
    
    for (List sentence : tokenizer) {
        // do something with the sentence
    }
    

    You can replace noneDeletein line 4 with other options. I am citing Manning:

    "(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."

    That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep. This way is way more elegant than any attempt to remove those characters.

提交回复
热议问题