I have a text file in the following format:
DELIMITER1
extract me
extract me
extract me
DELIMITER2
I\'d like to extract every block of
If the delimiters are within a line:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
for line in file_contents:
i1, i2 = line.find(d1), line.find(d2)
if -1 < i1 < i2:
yield line[i1+1:i2]
sentences = list(get_sentences('path/to/my/file'))
If they are on their own lines:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
results = []
for line in file_contents:
if d1 in line:
results = []
elif d2 in line:
yield results
else:
results.append(line)
sentences = list(get_sentences('path/to/my/file'))
This should do what you want:
import re
def GetTheSentences(file):
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
start = False
output = []
with open(file, 'rb') as datafile:
for line in datafile.readlines():
if re.match(start_rx, line):
start = True
elif re.match(end_rx, line):
start = False
if start:
output.append(line)
return output
Your previous version looks like it's supposed to be an iterator function. Do you want your output returned one item at a time? That's slightly different.
This is a good job for List comprehensions, no regex required. First list comp scrubs the typical \n
in the text line list found when opening txt file. Second list comp just uses in
operator to identify sequence patterns to filter.
def extract_lines(file):
scrubbed = [x.strip('\n') for x in open(file, 'r')]
return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]
You can simplify this to one regular expression using re.S
, the DOTALL flag.
import re
def GetTheSentences(infile):
with open(infile) as fp:
for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
print result
# extract me
# extract me
# extract me
This also makes use of the non-greedy operator .*?
, so multiple non-overlapping blocks of DELIMITER1-DELIMITER2 pairs will all be found.