Python: consecutive lines between matches similar to awk

后端未结

关注

 3  669

Given:

A multiline string string (already read from a file file)
Two patterns pattern1 and pattern2


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-15 06:56
              
            
            
                                                                       
Using a regex:

>>> print(a)

aaa aa a
bbb bb b
ccc cc c
ffffd dd d
eee ee e
fff ff f


The expected result:

>>> print(re.search('^.*bb b$\n((:?.+\n)+)^.*dd d$',a, re.M).group())
bbb bb b
ccc cc c
ffffd dd d


Or just the enclosed text:

>>> print(re.search('^.*bb b$\n((:?.+\n)+)^.*dd d$',a, re.M).group(1))
ccc cc c

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-15 07:04
              
            
            
                                                                       
In awk the /start/, /end/ range regex prints the entire line that the /start/is found in up to and including the entire line where the /end/ pattern is found. It is a useful construct and has been copied by Perl, sed, Ruby and others. 

To do a range operator in Python, write a class that keeps track of the state of the previous call to the start operator until the end operator. We can use a regex (as awk does) or this can be trivially modified to anything returning a True or False status for a line of data. 

Given your example file, you can do:

import re

class FlipFlop: 
    ''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
    def __init__(self, start_pattern, end_pattern):
        self.patterns = start_pattern, end_pattern
        self.state = False
    def __call__(self, st):
        ms=[e.search(st) for e in self.patterns]
        if all(m for m in ms):
            self.state = False
            return True
        rtr=True if self.state else False
        if ms[self.state]:
            self.state = not self.state
        return self.state or rtr

with open('/tmp/file') as f:
    ff=FlipFlop(re.compile('b bb'), re.compile('d dd'))
    print ''.join(line if ff(line) else "" for line in f)


Prints:

bbb bb b
ccc cc c
ffffd dd d


That retains a line-by-line file read with the flexibility of /start/,/end/ regex seen in other languages. Of course, you can do the same approach for a multiline string (assumed be named s): 

''.join(line+"\n" if ff(line) else "" for line in s.splitlines())


Idiomatically, in awk, you can get the same result as a flipflop using a flag:

$ awk '/b bb/{flag=1} flag{print $0} /d dd/{flag=0}' file


You can replicate that in Python as well (with more words):

flag=False    
with open('file') as f:
    for line in f:
        if re.search(r'b bb', line):
            flag=True
        if flag:
            print(line.rstrip())
        if re.search(r'd dd', line):
            flag=False  


Which can also be used with in memory string.       

Or, you can use a multi-line regex:

with open('/tmp/file') as f:
    print ''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', f.read(), re.M))


Demo and explanation   

But that requires reading the entire file into memory. Since you state the string has been read into memory, that is probably easiest in this case:

''.join(re.findall(r'^.*b bb[\s\S]*d dd.*$', s, re.M))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-01-15 07:05
              
            
            
                                                                       
Use the re.DOTALL to match on anything including newlines. Then plug in the beginning pattern and end pattern: 

re.search( '[\w ]*b bb.*?d dd[ \w]*', string, re.DOTALL).group(0)


Note: (1) string here is the file or string you wish to search through. (2) You'll need to import re.  If you really want to be concise, perhaps to the point of fault, you can combine reading the file and extracting the pattern: 

re.search( '[\w ]*b bb.*?d dd[ \w]*', open('file').read(), re.DOTALL).group(0) 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复