string
(already read from a file file
)pattern1
and pattern2
Using a regex
:
>>> print(a)
aaa aa a
bbb bb b
ccc cc c
ffffd dd d
eee ee e
fff ff f
The expected result:
>>> print(re.search('^.*bb b$\n((:?.+\n)+)^.*dd d$',a, re.M).group())
bbb bb b
ccc cc c
ffffd dd d
Or just the enclosed text:
>>> print(re.search('^.*bb b$\n((:?.+\n)+)^.*dd d$',a, re.M).group(1))
ccc cc c
In awk
the /start/, /end/
range regex prints the entire line that the /start/
is found in up to and including the entire line where the /end/
pattern is found. It is a useful construct and has been copied by Perl, sed, Ruby and others.
To do a range operator in Python, write a class that keeps track of the state of the previous call to the start
operator until the end
operator. We can use a regex (as awk
does) or this can be trivially modified to anything returning a True
or False
status for a line of data.
Given your example file, you can do:
import re
class FlipFlop:
''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
ms=[e.search(st) for e in self.patterns]
if all(m for m in ms):
self.state = False
return True
rtr=True if self.state else False
if ms[self.state]:
self.state = not self.state
return self.state or rtr
with open('/tmp/file') as f:
ff=FlipFlop(re.compile('b bb'), re.compile('d dd'))
print ''.join(line if ff(line) else "" for line in f)
Prints:
bbb bb b
ccc cc c
ffffd dd d
That retains a line-by-line file read with the flexibility of /start/,/end/
regex seen in other languages. Of course, you can do the same approach for a multiline string (assumed be named s
):
''.join(line+"\n" if ff(line) else "" for line in s.splitlines())
Idiomatically, in awk, you can get the same result as a flipflop using a flag:
$ awk '/b bb/{flag=1} flag{print $0} /d dd/{flag=0}' file
You can replicate that in Python as well (with more words):
flag=False
with open('file') as f:
for line in f:
if re.search(r'b bb', line):
flag=True
if flag:
print(line.rstrip())
if re.search(r'd dd', line):
flag=False
Which can also be used with in memory string.
Or, you can use a multi-line regex:
with open('/tmp/file') as f:
print ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', f.read(), re.M))
Demo and explanation
But that requires reading the entire file into memory. Since you state the string has been read into memory, that is probably easiest in this case:
''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', s, re.M))
Use the re.DOTALL to match on anything including newlines. Then plug in the beginning pattern and end pattern:
re.search( '[\w ]*b bb.*?d dd[ \w]*', string, re.DOTALL).group(0)
Note: (1) string
here is the file or string you wish to search through. (2) You'll need to import re
. If you really want to be concise, perhaps to the point of fault, you can combine reading the file and extracting the pattern:
re.search( '[\w ]*b bb.*?d dd[ \w]*', open('file').read(), re.DOTALL).group(0)