print start of html tags

后端未结

关注

 2  2018

I want to print out the first html tags thats has attributes


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-01-28 21:42
              
            
            
                                                                       
This seems pretty complicated, you can try with this expression, but it would fail in some cases. It would first collect the undesired instances, then at the end there is a capturing group for those desired. 

Maybe, it wouldn't be the best idea to use regular expressions here. 

Test

import re

regex = r"^\s*<\S+>\s*$|^\s*<\S+\s.*test.*?>.*?<\/\S+>$|^\s*(<.*>)\s*$"

test_str = """

<h1>test</h1>
    <h2>test2</h2>
    <div id="content"></div>
    <p>test3</p>
    <div class="test"></div>
    <div id="nav"></div>
    <p>test3</p>

"""

print(re.findall(regex, test_str, re.M))


Output

['', '', '<div id="content"></div>', '', '', '<div id="nav"></div>', '']


The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-28 22:01
              
            
            
                                                                       
You should use a non-greedy match for any number of characters to the left of the =, so:

r'<.*?=.*?>'


That will match a <, followed by a minimum number of characters, followed by a =, followed by the minimum number of characters until the >.

What you had:

r'<?=.*?>'


Means an optional <, followed by a =, followed by any string going up to the >. Since the < is optional and would only match if right before the =, you end up with no matches for it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复