How to set delimiters for PTB tokenizer?

问题

I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this- go to room no. #2145 or

go to room no. *2145

tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter?

回答1:

A quick solution is to use this option:

(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");

This will cause the tokenizer to just tokenize on white space. Do you need it to do anything other than tokenize on white space?

来源：https://stackoverflow.com/questions/32688640/how-to-set-delimiters-for-ptb-tokenizer

标签

nlp

tokenize

stanford-nlp

stringtokenizer

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!