Splitting sentences with nltk while preserving quotes

后端 未结 2 1596
囚心锁ツ
囚心锁ツ 2021-02-15 13:50

I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is w

2条回答
  •  终归单人心
    2021-02-15 14:42

    If I understand the problem correctly, then this regex should do it:

    import re
    
    text = '"This is a sentence. This is also a sentence," said the cat.'
    
    for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
        print "".join(grp)
    

    It's a combination of 2 patterns or'd together. The first one finds ordinary quoted sentences. The second finds ordinary sentences or sentences with a quotation followed by a period. If you have more complex sentences it may need some further adjusting.

提交回复
热议问题