I have a text file. I need to get a list of sentences.
How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.
i hope this will help you on latin,chinese,arabic text
import re
punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []
with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"\1\2<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]
Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)
output:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/
For simple cases (where sentences are terminated normally), this should work:
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
The regex is *\. +
, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).
Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences
starts with a capital letter?)
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(I haven't tried it!)
You can try using Spacy instead of regex. I use it and it does the job.
import spacy
nlp = spacy.load('en')
text = '''Your text here'''
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.
# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
sentFile.write (line);
sentFile.write ("\n");
sentFile.close;
Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...