发表新帖

发表新帖

How can I split a text into sentences?

前端未结

关注

 13  1011

傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答

北恋 (楼主)

2020-11-22 06:46
Also, be wary of additional top level domains that aren't included in some of the answers above.

For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

That could be addressed by editing the code above to read:
```
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"
```
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...

热议问题