NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

问题

I have 2 sentences in my dataset:

w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period

When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.

Here is word tokenize

>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']

and sent tokenize

>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']

I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.

Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.

回答1:

It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.

A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.

import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)

来源：https://stackoverflow.com/questions/44858741/nltk-tokenizer-and-stanford-corenlp-tokenizer-cannot-distinct-2-sentences-withou

标签

python

nlp

nltk

stanford-nlp

tokenize