I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info,
In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.
document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
So right now I am running two replaceAll()
commands before handing the document to the parser. The complete code snippet is
// remove invalid unicode characters
String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
// remove other unicode characters coreNLP can't handle
String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));
for (List sentence : tokenizer) {
List tagged = tagger.tagSentence(sentence);
GrammaticalStructure gs = parser.predict(tagged);
System.err.println(gs);
}
This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.
Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.
UPDATE Nov 27ths
Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;
. Take this code example to tokenize a document:
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));
TokenizerFactory extends HasWord> factory=null;
factory=PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
tokenizer.setTokenizerFactory(factory);
for (List sentence : tokenizer) {
// do something with the sentence
}
You can replace noneDelete
in line 4 with other options. I am citing Manning:
"(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."
That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep
. This way is way more elegant than any attempt to remove those characters.