Python regular expression for HTML parsing (BeautifulSoup)

前端 未结 7 2097
感情败类
感情败类 2020-11-27 19:21

I want to grab the value of a hidden input field in HTML.


I

相关标签:
7条回答
  • 2020-11-27 19:38
    /<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/
    
    >>> import re
    >>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
    >>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
    ('fooId', '12-3456789-1111111111')
    
    0 讨论(0)
  • 2020-11-27 19:40

    For this particular case, BeautifulSoup is harder to write than a regex, but it is much more robust... I'm just contributing with the BeautifulSoup example, given that you already know which regexp to use :-)

    from BeautifulSoup import BeautifulSoup
    
    #Or retrieve it from the web, etc. 
    html_data = open('/yourwebsite/page.html','r').read()
    
    #Create the soup object from the HTML data
    soup = BeautifulSoup(html_data)
    fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
    value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                              #or index it directly via fooId['value']
    
    0 讨论(0)
  • 2020-11-27 19:41
    import re
    reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
    value = reg.search(inputHTML).group(1)
    print 'Value is', value
    
    0 讨论(0)
  • 2020-11-27 19:45

    Parsing is one of those areas where you really don't want to roll your own if you can avoid it, as you'll be chasing down the edge-cases and bugs for years go come

    I'd recommend using BeautifulSoup. It has a very good reputation and looks from the docs like it's pretty easy to use.

    0 讨论(0)
  • 2020-11-27 19:54

    Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS.

    Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag. Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment:

    html = """<html><body>
    <input type="hidden" name="fooId" value="**[id is here]**" />
    <blah>
    <input name="fooId" type="hidden" value="**[id is here too]**" />
    <input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
    <INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
    <!--
    <input type="hidden" name="fooId" value="**[don't report this id]**" />
    -->
    <foo>
    </body></html>"""
    
    from pyparsing import makeHTMLTags, withAttribute, htmlComment
    
    # use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
    # opening and closing tags, we're only interested in the opening tag
    inputTag = makeHTMLTags("input")[0]
    
    # only want input tags with special attributes
    inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))
    
    # don't report tags that are commented out
    inputTag.ignore(htmlComment)
    
    # use searchString to skip through the input 
    foundTags = inputTag.searchString(html)
    
    # dump out first result to show all returned tags and attributes
    print foundTags[0].dump()
    print
    
    # print out the value attribute for all matched tags
    for inpTag in foundTags:
        print inpTag.value
    

    Prints:

    ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
    - empty: True
    - name: fooId
    - startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
      - empty: True
      - name: fooId
      - type: hidden
      - value: **[id is here]**
    - type: hidden
    - value: **[id is here]**
    
    **[id is here]**
    **[id is here too]**
    **[id is HERE too]**
    **[and id is even here TOO]**
    

    You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

    0 讨论(0)
  • 2020-11-27 19:55
    /<input type="hidden" name="fooId" value="([\d-]+)" \/>/
    
    0 讨论(0)
提交回复
热议问题