问题
Very helpfully Stanford NLP core 3.9.2 used to split rolled together Spanish verbs and pronouns
This is the 4.0.0 output:
The previous version had more .tagger files. These have not been included with the 4.0.0 distribution.
Is that the cause. Will be they added back?
回答1:
There are some documentation updates that still need to be made for Stanford CoreNLP 4.0.0.
A major change is that a new multi-word-token annotator has been added, that makes tokenization conform with the UD standard. So the new default Spanish pipeline should run tokenize,ssplit,mwt,pos,depparse,ner
. It may not be possible to run such a pipeline from the server demo at this time, as some modifications will need to be made. I can try to send you what such modifications would be soon. We will try to make a new release in early summer to handle issues like this that we missed.
It won't split the word in your example unfortunately, but I think in many cases it will do the correct thing. The Spanish mwt
model is just based off of a large dictionary of terms, and was tuned to optimize performance on the Spanish training data.
来源:https://stackoverflow.com/questions/61540771/stanford-nlp-core-4-0-0-no-longer-splitting-verbs-and-pronouns-in-spanish