How can I split a text into sentences?

前端 未结 13 1025
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答
  •  花落未央
    2020-11-22 07:00

    @Artyom,

    Hi! You could make a new tokenizer for Russian (and some other languages) using this function:

    def russianTokenizer(text):
        result = text
        result = result.replace('.', ' . ')
        result = result.replace(' .  .  . ', ' ... ')
        result = result.replace(',', ' , ')
        result = result.replace(':', ' : ')
        result = result.replace(';', ' ; ')
        result = result.replace('!', ' ! ')
        result = result.replace('?', ' ? ')
        result = result.replace('\"', ' \" ')
        result = result.replace('\'', ' \' ')
        result = result.replace('(', ' ( ')
        result = result.replace(')', ' ) ') 
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.replace('  ', ' ')
        result = result.strip()
        result = result.split(' ')
        return result
    

    and then call it in this way:

    text = 'вы выполняете поиск, используя Google SSL;'
    tokens = russianTokenizer(text)
    

    Good luck, Marilena.

提交回复
热议问题