问题
There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ?
i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token.
回答1:
There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what you want, though there are two concerns worth mentioning:
- All models distributed with CoreNLP are trained on the standard tokenizer behavior. If you change how the input to these later components are tokenized, there's no guarantee that these components will work predictably.
- If you do enough pre- and post-processing (and aren't using any later components as mentioned in #1), it may be simpler to just steal the PTBTokenizer implementation and write your own.
(There is a similar question on customizing apostrophe tokenization behavior: Stanford coreNLP - split words ignoring apostrophe.)
来源:https://stackoverflow.com/questions/29229780/stanford-ptbtokenizer-tokens-split-delimiter