Splitting sentences with nltk while preserving quotes

后端 未结 2 1603
囚心锁ツ
囚心锁ツ 2021-02-15 13:50

I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is w

相关标签:
2条回答
  • 2021-02-15 14:34

    Just change your print statement to this:

    print ' '.join(tokenizer.tokenize(text, realign_boundaries=True))
    

    This will join the sentences with a space instead of \n-----\n.

    0 讨论(0)
  • 2021-02-15 14:42

    If I understand the problem correctly, then this regex should do it:

    import re
    
    text = '"This is a sentence. This is also a sentence," said the cat.'
    
    for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
        print "".join(grp)
    

    It's a combination of 2 patterns or'd together. The first one finds ordinary quoted sentences. The second finds ordinary sentences or sentences with a quotation followed by a period. If you have more complex sentences it may need some further adjusting.

    0 讨论(0)
提交回复
热议问题