Stanford PTBTokenizer token's split delimiter

问题

There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ?

i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token.

回答1:

There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what you want, though there are two concerns worth mentioning:

All models distributed with CoreNLP are trained on the standard tokenizer behavior. If you change how the input to these later components are tokenized, there's no guarantee that these components will work predictably.
If you do enough pre- and post-processing (and aren't using any later components as mentioned in #1), it may be simpler to just steal the PTBTokenizer implementation and write your own.

(There is a similar question on customizing apostrophe tokenization behavior: Stanford coreNLP - split words ignoring apostrophe.)

来源：https://stackoverflow.com/questions/29229780/stanford-ptbtokenizer-tokens-split-delimiter

标签

tokenize

stanford-nlp

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!