问题
Is there an option in Stanford CoreNLP's tokenizer to prevent tokens from containing a space?
E.g. if the sentence is "my phone is 617 1555-6644", the substring "617 1555" should be Into two different tokens.
I am aware of the option normalizeSpace:
normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens.
but I don't want tokens to contain any space, including non-breaking space.
回答1:
You can try to set the tokenize.whitespace
option to true, but this will tokenize always and only on whitespace. For example, "it's" will not longer tokenize to "it 's".
来源:https://stackoverflow.com/questions/36440495/preventing-tokens-from-containing-a-space-in-stanford-corenlp