How can I split a text into sentences?

前端未结

关注

 13  1029

傲寒

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

相关标签:

13条回答

南旧

2020-11-22 06:37

i hope this will help you on latin,chinese,arabic text

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

0 讨论(0)

刺人心

2020-11-22 06:37
Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.
```
from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 
```
output:
```
['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']
```
Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-11-22 06:38
For simple cases (where sentences are terminated normally), this should work:
```
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
```
The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-11-22 06:39
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
```
import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
```
(I haven't tried it!)
0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2020-11-22 06:39
You can try using Spacy instead of regex. I use it and it does the job.
```
import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-22 06:40
I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.
```
# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;
```
Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页