How can I split a text into sentences?

前端 未结 13 1015
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答
  •  长情又很酷
    2020-11-22 06:38

    For simple cases (where sentences are terminated normally), this should work:

    import re
    text = ''.join(open('somefile.txt').readlines())
    sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
    

    The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

    Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

提交回复
热议问题