问题
I have 2 sentences in my dataset:
w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period
When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.
Here is word tokenize
>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']
and sent tokenize
>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']
I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.
Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'
meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.
回答1:
It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.
A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.
import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)
来源:https://stackoverflow.com/questions/44858741/nltk-tokenizer-and-stanford-corenlp-tokenizer-cannot-distinct-2-sentences-withou