Lets say I have a Text file with the below content
fdsjhgjhg
fdshkjhk
Start
Good Morning
Hello World
End
dashjkhjk
dsfjkhk
Start
hgjkkl
dfghjjk
You can do this with regular expressions. This will exclude rogue Start
and End
lines. Here is a live example
import re
f = open('test.txt','r')
txt = f.read()
matches = re.findall(r'^\s*Start\s*$\n((?:^\s*(?!Start).*$\n)*?)^\s*End\s*$', txt, flags=re.M)
If you don't expect to get nested structures, you could do this:
# match everything between "Start" and "End"
occurences = re.findall(r"Start(.*?)End", text, re.DOTALL)
# discard text before duplicated occurences of "Start"
occurences = [oc.rsplit("Start", 1)[-1] for oc in occurences]
# optionally trim whitespaces
occurences = [oc.strip("\n") for oc in occurences]
Which prints
>>> for oc in occurences: print(oc)
Good Morning
Hello World
Good Evening
Good
You can add the \n
as part of Start
and End
if you want
Great problem! This is a bucket problem where each start needs an end.
The reason why you got the result is because there are two consecutive 'Start'.
It's best to store the information somewhere until 'End' is triggered.
infile = open('scores.txt','r')
outfile= open('testt.txt','w')
copy = False
for line in infile:
if line.strip() == "Start":
bucket = []
copy = True
elif line.strip() == "End":
for strings in bucket:
outfile.write( strings + '\n')
copy = False
elif copy:
bucket.append(line.strip())
You could keep a temporary list of lines, and only commit them after you know that a section meets your criteria. Maybe try something like the following:
infile = open('test.txt','r')
outfile= open('testt.txt','w')
copy = False
tmpLines = []
for line in infile:
if line.strip() == "Start":
copy = True
tmpLines = []
elif line.strip() == "End":
copy = False
for tmpLine in tmpLines:
outfile.write(tmpLine)
elif copy:
tmpLines.append(line)
This gives the output
Good Morning
Hello World
Good Evening
Good
Here's a hacky but perhaps more intuitive way using regex. It finds all text that exists between "Start" and "End" pairs, and the print statement trims them off.
import re
infile = open('test.txt','r')
text = infile.read()
matches = re.findall('Start.*?End',text)
for m in matches:
print m.strip('Start ').strip(' End')