How can I split a text into sentences?

前端 未结 13 1007
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答
  •  北海茫月
    2020-11-22 06:40

    I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

    # Very approximate way to split the text into sentences - Break after ? . and !
    fullFile = re.sub("(\!|\?|\.) ","\\1",fullFile)
    sentences = fullFile.split("");
    sentFile = open("./sentences.out", "w+");
    for line in sentences:
        sentFile.write (line);
        sentFile.write ("\n");
    sentFile.close;
    

    Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

提交回复
热议问题