Replace SRC of all IMG elements using Parser

后端 未结 2 818
伪装坚强ぢ
伪装坚强ぢ 2020-12-06 12:41

I am looking for a way to replace the SRC attribute in all IMG tags not using Regular expressions. (Would like to use any out-of-the box HTML parser included with default Py

相关标签:
2条回答
  • 2020-12-06 13:24

    Here is a pyparsing approach to your problem. You'll need to do your own code to transform the http src attribute.

    from pyparsing import *
    import urllib2
    
    imgtag = makeHTMLTags("img")[0]
    
    page = urllib2.urlopen("http://www.yahoo.com")
    html = page.read()
    page.close()
    
    # print html
    
    def modifySrcRef(tokens):
        ret = "<img"
        for k,i in tokens.items():
            if k in ("startImg","empty"): continue
            if k.lower() == "src":
                # or do whatever with this
                i = i.upper() 
            ret += ' %s="%s"' % (k,i)
        return ret + " />"
    
    imgtag.setParseAction(modifySrcRef)
    
    print imgtag.transformString(html)
    

    The tags convert to:

    <img src="HTTP://L.YIMG.COM/A/I/WW/BETA/Y3.GIF" title="Yahoo" height="44" width="232" alt="Yahoo!" />
    <a href="r/xy"><img src="HTTP://L.YIMG.COM/A/I/WW/TBL/ALLYS.GIF" height="20" width="138" alt="All Yahoo! Services" border="0" /></a>
    
    0 讨论(0)
  • 2020-12-06 13:35

    There is a HTML parser in the Python standard library, but it’s not very useful and it’s deprecated since Python 2.6. Doing this kind of things with BeautifulSoup is really easy:

    from BeautifulSoup import BeautifulSoup
    from os.path import basename, splitext
    soup = BeautifulSoup(my_html_string)
    for img in soup.findAll('img'):
        img['src'] = 'cid:' + splitext(basename(img['src']))[0]
    my_html_string = str(soup)
    
    0 讨论(0)
提交回复
热议问题