发表新帖

发表新帖

How can I split a text into sentences?

前端未结

关注

 13  1015

傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答

长情又很酷 (楼主)

2020-11-22 06:38
For simple cases (where sentences are terminated normally), this should work:
```
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
```
The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...

热议问题