Regex within html tags

前端未结

关注

 5  529

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2021-01-24 10:25
              
            
            
                                                                       
You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val  = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-01-24 10:33
              
            
            
                                                                       
You can still parse using BeautifulSoup, you don't need the full html:

from bs4 import BeautifulSoup
html="""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"}) 
print sp.text[1:]
19.99

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-24 10:33
              
            
            
                                                                       
The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

from bs4 import BeautifulSoup

soup = """<div id="left-stack">        
 <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>"""

for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
    price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text


In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2021-01-24 10:38
              
            
            
                                                                       
BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]


Prints:

19.99

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2021-01-24 10:38
              
            
            
                                                                       
You can use this regex:

\d+(?:\.\d+)?(?=\D+HD Version)



\D+ skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99) is the last digit ahead of HD Version.


Here is a regex demo.

Use the i modifier in the regex to make the matching case-insensitive and change + to* if the number can be directly before HD Version.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复