How can I split a text into sentences?

前端 未结 13 1011
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答
  •  北恋
    北恋 (楼主)
    2020-11-22 06:46

    Also, be wary of additional top level domains that aren't included in some of the answers above.

    For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

    Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

    That could be addressed by editing the code above to read:

    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"
    

提交回复
热议问题