I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is w
Just change your print statement to this:
print ' '.join(tokenizer.tokenize(text, realign_boundaries=True))
This will join the sentences with a space instead of \n-----\n
.
If I understand the problem correctly, then this regex should do it:
import re
text = '"This is a sentence. This is also a sentence," said the cat.'
for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
print "".join(grp)
It's a combination of 2 patterns or'd together. The first one finds ordinary quoted sentences. The second finds ordinary sentences or sentences with a quotation followed by a period. If you have more complex sentences it may need some further adjusting.