How can I split a text into sentences?

前端 未结 13 1010
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

相关标签:
13条回答
  • 2020-11-22 06:46

    Also, be wary of additional top level domains that aren't included in some of the answers above.

    For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

    Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

    That could be addressed by editing the code above to read:

    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"
    
    0 讨论(0)
  • 2020-11-22 06:51

    This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

    # -*- coding: utf-8 -*-
    import re
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov)"
    
    def split_into_sentences(text):
        text = " " + text + "  "
        text = text.replace("\n"," ")
        text = re.sub(prefixes,"\\1<prd>",text)
        text = re.sub(websites,"<prd>\\1",text)
        if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
        text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
        text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
        text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
        text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
        text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
        text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
        text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
        if "”" in text: text = text.replace(".”","”.")
        if "\"" in text: text = text.replace(".\"","\".")
        if "!" in text: text = text.replace("!\"","\"!")
        if "?" in text: text = text.replace("?\"","\"?")
        text = text.replace(".",".<stop>")
        text = text.replace("?","?<stop>")
        text = text.replace("!","!<stop>")
        text = text.replace("<prd>",".")
        sentences = text.split("<stop>")
        sentences = sentences[:-1]
        sentences = [s.strip() for s in sentences]
        return sentences
    
    0 讨论(0)
  • 2020-11-22 06:51

    No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

    So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

    # split up a paragraph into sentences
    # using regular expressions
    
    
    def splitParagraphIntoSentences(paragraph):
        ''' break a paragraph into sentences
            and return a list '''
        import re
        # to split by multile characters
    
        #   regular expressions are easiest (and fastest)
        sentenceEnders = re.compile('[.!?]')
        sentenceList = sentenceEnders.split(paragraph)
        return sentenceList
    
    
    if __name__ == '__main__':
        p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""
    
        sentences = splitParagraphIntoSentences(p)
        for s in sentences:
            print s.strip()
    
    #output:
    #   This is a sentence
    #   This is an excited sentence
    
    #   And do you think this is a question 
    
    0 讨论(0)
  • 2020-11-22 06:56

    Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

    abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                     'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
    terminators = ['.', '!', '?']
    wrappers = ['"', "'", ')', ']', '}']
    
    
    def find_sentences(paragraph):
       end = True
       sentences = []
       while end > -1:
           end = find_sentence_end(paragraph)
           if end > -1:
               sentences.append(paragraph[end:].strip())
               paragraph = paragraph[:end]
       sentences.append(paragraph)
       sentences.reverse()
       return sentences
    
    
    def find_sentence_end(paragraph):
        [possible_endings, contraction_locations] = [[], []]
        contractions = abbreviations.keys()
        sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
        for sentence_terminator in sentence_terminators:
            t_indices = list(find_all(paragraph, sentence_terminator))
            possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
        for contraction in contractions:
            c_indices = list(find_all(paragraph, contraction))
            contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
        possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
        if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
            max_end_start = max([pe[0] for pe in possible_endings])
            possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
        possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
        end = (-1 if not len(possible_endings) else max(possible_endings))
        return end
    
    
    def find_all(a_str, sub):
        start = 0
        while True:
            start = a_str.find(sub, start)
            if start == -1:
                return
            yield start
            start += len(sub)
    

    I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

    0 讨论(0)
  • 2020-11-22 07:00

    You can also use sentence tokenization function in NLTK:

    from nltk.tokenize import sent_tokenize
    sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."
    
    sent_tokenize(sentence)
    
    0 讨论(0)
  • 2020-11-22 07:00

    @Artyom,

    Hi! You could make a new tokenizer for Russian (and some other languages) using this function:

    def russianTokenizer(text):
        result = text
        result = result.replace('.', ' . ')
        result = result.replace(' .  .  . ', ' ... ')
        result = result.replace(',', ' , ')
        result = result.replace(':', ' : ')
        result = result.replace(';', ' ; ')
        result = result.replace('!', ' ! ')
        result = result.replace('?', ' ? ')
        result = result.replace('\"', ' \" ')
        result = result.replace('\'', ' \' ')
        result = result.replace('(', ' ( ')
        result = result.replace(')', ' ) ') 
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.strip()
        result = result.split(' ')
        return result
    

    and then call it in this way:

    text = 'вы выполняете поиск, используя Google SSL;'
    tokens = russianTokenizer(text)
    

    Good luck, Marilena.

    0 讨论(0)
提交回复
热议问题