I have a large log file, and I want to extract a multi-line string between two strings: start
and end
.
The following is sample from the
This regex should match what you want:
(start((?!start).)*?end)
Use re.findall
method and single-line modifier re.S
to get all the occurences in a multi-line string:
re.findall('(start((?!start).)*?end)', text, re.S)
See a test here.
You could do (?s)start.*?(?=end|start)(?:end)?
, then filter out everything not ending in "end".
This is tricky to do because by default, the re
module does not look at overlapping matches. Newer versions of Python have a new regex
module that allows for overlapping matches.
https://pypi.python.org/pypi/regex
You'd want to use something like
regex.findall(pattern, string, overlapped=True)
If you're stuck with Python 2.x or something else that doesn't have regex
, it's still possible with some trickery. One brilliant person solved it here:
Python regex find all overlapping matches?
Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.
Do it with code - basic state machine:
open = False
tmp = []
for ln in fi:
if 'start' in ln:
if open:
tmp = []
else:
open = True
if open:
tmp.append(ln)
if 'end' in ln:
open = False
for x in tmp:
print x
tmp = []