问题
I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this-
go to room no. #2145
or
go to room no. *2145
tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter?
回答1:
A quick solution is to use this option:
(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");
This will cause the tokenizer to just tokenize on white space. Do you need it to do anything other than tokenize on white space?
来源:https://stackoverflow.com/questions/32688640/how-to-set-delimiters-for-ptb-tokenizer