Issue with Regular expressions in python

前端未结

关注

 6  913

Ok, so i\'m working on a regular expression to search out all the header information in a site.

I\'ve compiled the regular expression:

regex = re.compile


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-01-21 21:21
              
            
            
                                                                       
Because of the braces around the anchor tag, that part is interpreted as a capture group. This causes only the capture group to be returned, and not the whole regex match.

Put the entire regex in braces and you'll see the right matches showing up as the first element in the returned tuples.

But indeed, you should use a real parser.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2021-01-21 21:23
              
            
            
                                                                       
Building on the answers so far:

It's best to use a parsing engine. It can cover a lot of cases and in an elegant way. I've tried BeautifulSoup and I like it very much. Also easy to use, with a great tutorial.

If sometimes it feels like shooting flies with a cannon you can use a regular expression for quick parsing. If that's what you need here is the modified code that will catch all the headers (even those over multiple lines):

p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL)
stories = re.findall(p, html)
for i in stories:
    print i

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-01-21 21:26
              
            
            
                                                                       
This question has been asked in several forms over the last few days, so I'm going to say this very clearly.

Q: How do I parse HTML with Regular Expressions?

A: Please Don't.

Use BeautifulSoup, html5lib or lxml.html. Please.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2021-01-21 21:35
              
            
            
                                                                       
As has been mentioned, you should use a parser instead of a regex.

This is how you could do it with a regex though:

import re

html = '''
<body>

<h1>Dog </h1>
<h2>Cat </h2>
<h3>Fancy </h3>
<h1>Tall cup of lemons</h1>
<h1><a href="dog.com">Dog thing</a></h1>
</body>
'''

p = re.compile(r'''
    <(?P<header>h[0-9])>             # store header tag for later use
    \s*                              # zero or more whitespace
    (<a\shref="(?P<href>.*?)">)?     # optional link tag. store href portion
    \s*
    (?P<title>.*?)                   # title
    \s*
    (</a>)?                          # optional closing link tag
    \s*
    </(?P=header)>                   # must match opening header tag
''', re.IGNORECASE + re.VERBOSE)

stories = p.finditer(html)

for match in stories:
    print '%(title)s [%(href)s]' % match.groupdict()


Here are a couple of good regular expression resources:


Python Regular Expression HOWTO
Regular-Expressions.info

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  伪装坚强ぢ        
                
              
                            
                2021-01-21 21:39
              
            
            
                                                                       
Parsing things with regular expressions works for regular languages. HTML is not a regular language, and the stuff you find on web pages these days is absolute crap. BeautifulSoup deals with tag-soup HTML with browser-like heuristics so you get parsed HTML that resembles what a browser would display.

The downside is it's not very fast. There's lxml for parsing well-formed html, but you should really use BeautifulSoup if you're not 100% certain that your input will always be well-formed.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-01-21 21:40
              
            
            
                                                                       
I have used beautifulsoup to parse your desired HTML. I have the above HTML code in
a file called foo.html and later read as a file object.

from BeautifulSoup import BeautifulSoup


H_TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']

def extract_data():
   """Extract the data from all headers
   in a HTML page."""
   f = open('foo.html', 'r+')
   html = f.read()
   soup = BeautifulSoup(html)
   headers = [soup.findAll(h) for h in H_TAGS if soup.findAll(h)]
   lst = []
   for x in headers:
      for y in x:
         if y.string:
            lst.append(y.string)
         else:
            lst.append(y.contents[0].string)
   return lst


The above function returns:

>>> [u'Dog ', u'Tall cup of lemons', u'Dog thing', u'Cat ', u'Fancy ']


You can add any number of header tags in h_tags list. I have assumed all the headers.
If you can solve things easily using BeautifulSoup then its better to use it. :)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复