BeautifulSoup gives me unicode+html symbols, rather than straight up unicode. Is this a bug or misunderstanding?

前端 未结 2 672
滥情空心
滥情空心 2021-01-13 02:53

I\'m using BeautifulSoup to scrape a website. The website\'s page renders fine in my browser:

Oxfam International’s report entitled “Offside! http:

相关标签:
2条回答
  • 2021-01-13 03:08

    It's actually UTF-8 misencoded as CP1252:

    >>> print u'Oxfam International\xe2€™s report entitled \xe2€œOffside!'.encode('cp1252').decode('utf8')
    Oxfam International’s report entitled “Offside!
    
    0 讨论(0)
  • 2021-01-13 03:26

    That's one seriously messed up page, encoding-wise :-)

    There's nothing really wrong with your approach at all. I would probably tend to do the conversion before passing it to BeautifulSoup, just because I'm persnickity:

    import urllib
    html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
    h = html.decode('iso-8859-1')
    soup = BeautifulSoup(h)
    

    In this case, the page's meta tag is lying about the encoding. The page is actually in utf-8... Firefox's page info reveals the real encoding, and you can actually see this charset in the response headers returned by the server:

    curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
    HTTP/1.1 200 OK
    Connection: close
    Date: Tue, 10 Mar 2009 13:14:29 GMT
    Server: Microsoft-IIS/6.0
    X-Powered-By: ASP.NET
    Set-Cookie: COMPANYID=271;path=/
    Content-Language: en-US
    Content-Type: text/html; charset=UTF-8
    

    If you do the decode using 'utf-8', it will work for you (or, at least, is did for me):

    import urllib
    html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
    h = html.decode('utf-8')
    soup = BeautifulSoup(h)
    ps = soup.body("p")
    p = ps[52]
    print p
    
    0 讨论(0)
提交回复
热议问题