问题
I use Stanford NLP
for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---
, >
, .
etc.) and not important words like am
, is
, to
(stop words). Does anybody know a way to solve this problem?
回答1:
This is a very domain-specific task that we don't perform for you in CoreNLP. You should be able to make this work with a regular expression filter and a stopword filter on top of the CoreNLP tokenizer.
Here's an example list of English stopwords.
回答2:
In stanford Corenlp, there is a stopword removal annotator which provides the functionality to remove the standord stopwords. You can also define custom stopwords here as per your need (i.e ---,<,. etc)
You can see the example here:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, stopword");
props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(example);
pipeline.annotate(document);
List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);
Here in the above example "tokenize,ssplit,stopwords" are set as custom stopwords.
Hope it'll help you....!!
来源:https://stackoverflow.com/questions/30019054/text-tokenization-with-stanford-nlp-filter-unrequired-words-and-characters