Regex within html tags

前端 未结 5 529
孤独总比滥情好
孤独总比滥情好 2021-01-24 10:19

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.



        
相关标签:
5条回答
  • 2021-01-24 10:25

    You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

    >>> from bs4 import BeautifulSoup
    >>> html = '''
    <div id="left-stack">        
      <span>View In iTunes</span></a>
     <span class="price">£19.99</span>
     <ul class="list">
        <li>HD Version</li>'''
    >>> soup = BeautifulSoup(html)
    >>> val  = soup.find('span', {'class':'price'}).text
    >>> print val[1:]
    19.99
    
    0 讨论(0)
  • 2021-01-24 10:33

    You can still parse using BeautifulSoup, you don't need the full html:

    from bs4 import BeautifulSoup
    html="""
    <div id="left-stack">
      <span>View In iTunes</span></a>
     <span class="price">£19.99</span>
     <ul class="list">
        <li>HD Version</li>
    """
    
    soup = BeautifulSoup(html)
    sp = soup.find(attrs={"class":"price"}) 
    print sp.text[1:]
    19.99
    
    0 讨论(0)
  • 2021-01-24 10:33

    The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

    from bs4 import BeautifulSoup
    
    soup = """<div id="left-stack">        
     <span>View In iTunes</span></a>
     <span class="price">£19.99</span>
     <ul class="list">
        <li>HD Version</li>"""
    
    for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
        price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text
    

    In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.

    0 讨论(0)
  • 2021-01-24 10:38

    BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

    # -*- coding: utf-8 -*-
    from bs4 import BeautifulSoup
    
    data = u"""
    <div id="left-stack">
      <span>View In iTunes</span></a>
     <span class="price">£19.99</span>
     <ul class="list">
        <li>HD Version</li>
    """
    
    soup = BeautifulSoup(data)
    print soup.find('span', class_='price').text[1:]
    

    Prints:

    19.99
    
    0 讨论(0)
  • 2021-01-24 10:38

    You can use this regex:

    \d+(?:\.\d+)?(?=\D+HD Version)
    
    • \D+ skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99) is the last digit ahead of HD Version.

    Here is a regex demo.

    Use the i modifier in the regex to make the matching case-insensitive and change + to* if the number can be directly before HD Version.

    0 讨论(0)
提交回复
热议问题