Decoding HTML entities with Python

前端 未结 4 998
野趣味
野趣味 2020-12-04 16:46

I\'m trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

\"U.S. Adviser’         


        
相关标签:
4条回答
  • 2020-12-04 17:34

    Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

      (the type you have) is a "numeric character reference" (decimal).
      is a "numeric character reference" (hexadecimal).
      is an entity.

    Further reading: http://htmlhelp.com/reference/html40/entities/

    Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

    0 讨论(0)
  • 2020-12-04 17:36

    This does work:

    from BeautifulSoup import BeautifulStoneSoup
    s = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
    decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
    

    If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

    result = decoded.encode("UTF-8")
    

    It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

    0 讨论(0)
  • 2020-12-04 17:47

    Try this:

    import re
    
    def _callback(matches):
        id = matches.group(1)
        try:
            return unichr(int(id))
        except:
            return id
    
    def decode_unicode_references(data):
        return re.sub("&#(\d+)(;|(?=\s))", _callback, data)
    
    data = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
    print decode_unicode_references(data)
    
    0 讨论(0)
  • 2020-12-04 17:49
    >>> from HTMLParser import HTMLParser
    >>> print HTMLParser().unescape('U.S. Adviser’s Blunt Memo on Iraq: '
    ...                             'Time ‘to Go Home’')
    U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’
    

    The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

    0 讨论(0)
提交回复
热议问题