How can I split a text into sentences?

前端 未结 13 1029
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

相关标签:
13条回答
  • 2020-11-22 06:37

    i hope this will help you on latin,chinese,arabic text

    import re
    
    punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
    lines = []
    
    with open('myData.txt','r',encoding="utf-8") as myFile:
        lines = punctuation.sub(r"\1\2<pad>", myFile.read())
        lines = [line.strip() for line in lines.split("<pad>") if line.strip()]
    
    0 讨论(0)
  • 2020-11-22 06:37

    Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.

    from nltk.tokenize import sent_tokenize 
      
    text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
    sent_tokenize(text) 
    

    output:

    ['Hello everyone.',
     'Welcome to GeeksforGeeks.',
     'You are studying NLP article']
    

    Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

    0 讨论(0)
  • 2020-11-22 06:38

    For simple cases (where sentences are terminated normally), this should work:

    import re
    text = ''.join(open('somefile.txt').readlines())
    sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
    

    The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

    Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

    0 讨论(0)
  • 2020-11-22 06:39

    The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

    import nltk.data
    
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    fp = open("test.txt")
    data = fp.read()
    print '\n-----\n'.join(tokenizer.tokenize(data))
    

    (I haven't tried it!)

    0 讨论(0)
  • 2020-11-22 06:39

    You can try using Spacy instead of regex. I use it and it does the job.

    import spacy
    nlp = spacy.load('en')
    
    text = '''Your text here'''
    tokens = nlp(text)
    
    for sent in tokens.sents:
        print(sent.string.strip())
    
    0 讨论(0)
  • 2020-11-22 06:40

    I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

    # Very approximate way to split the text into sentences - Break after ? . and !
    fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
    sentences = fullFile.split("<BRK>");
    sentFile = open("./sentences.out", "w+");
    for line in sentences:
        sentFile.write (line);
        sentFile.write ("\n");
    sentFile.close;
    

    Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

    0 讨论(0)
提交回复
热议问题