Decode HTML entities in Python string?

后端 未结 6 912
名媛妹妹
名媛妹妹 2020-11-21 06:18

I\'m parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn\'t automatically decode for me:

>>> from Be         


        
6条回答
  •  既然无缘
    2020-11-21 06:56

    This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).

    import re
    import HTMLParser
    
    regexp = "&.+?;" 
    list_of_html = re.findall(regexp, page) #finds all html entites in page
    for e in list_of_html:
        h = HTMLParser.HTMLParser()
        unescaped = h.unescape(e) #finds the unescaped value of the html entity
        page = page.replace(e, unescaped) #replaces html entity with unescaped value
    

提交回复
热议问题