Preventing tokens from containing a space in Stanford CoreNLP

问题

Is there an option in Stanford CoreNLP's tokenizer to prevent tokens from containing a space?

E.g. if the sentence is "my phone is 617 1555-6644", the substring "617 1555" should be Into two different tokens.

I am aware of the option normalizeSpace:

normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens.

but I don't want tokens to contain any space, including non-breaking space.

回答1:

You can try to set the tokenize.whitespace option to true, but this will tokenize always and only on whitespace. For example, "it's" will not longer tokenize to "it 's".

来源：https://stackoverflow.com/questions/36440495/preventing-tokens-from-containing-a-space-in-stanford-corenlp

标签

nlp

stanford-nlp

tokenize

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!