How to download any(!) webpage with correct charset in python?

前端 未结 7 1863
醉酒成梦
醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

相关标签:
7条回答
  • 2020-11-30 20:47

    When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

    fp = urllib2.urlopen(request)
    charset = fp.headers.getparam('charset')
    

    You can use BeautifulSoup to locate a meta element in the HTML:

    soup = BeatifulSoup.BeautifulSoup(data)
    meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    

    If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

    0 讨论(0)
  • 2020-11-30 20:49

    Use the Universal Encoding Detector:

    >>> import chardet
    >>> chardet.detect(urlread("http://google.cn/"))
    {'encoding': 'GB2312', 'confidence': 0.99}
    

    The other option would be to just use wget:

      import os
      h = os.popen('wget -q -O foo1.txt http://foo.html')
      h.close()
      s = open('foo1.txt').read()
    
    0 讨论(0)
  • 2020-11-30 20:49

    BeautifulSoup dose this with UnicodeDammit : Unicode, Dammit

    0 讨论(0)
  • 2020-11-30 20:50

    instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..

    from win32com.client import DispatchWithEvents
    import threading
    
    
    stopEvent=threading.Event()
    
    class EventHandler(object):
        def OnDownloadBegin(self):
            pass
    
    def waitUntilReady(ie):
        """
        copypasted from
        http://mail.python.org/pipermail/python-win32/2004-June/002040.html
        """
        if ie.ReadyState!=4:
            while 1:
                print "waiting"
                pythoncom.PumpWaitingMessages()
                stopEvent.wait(.2)
                if stopEvent.isSet() or ie.ReadyState==4:
                    stopEvent.clear()
                    break;
    
    ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
    ie.Visible = 0
    ie.Navigate('http://kskky.info')
    waitUntilReady(ie)
    d = ie.Document
    print d.CharSet
    
    0 讨论(0)
  • 2020-11-30 20:55

    Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta> tags, BOM marks and differences in encoding names in account.

    Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta> or BOM marks are not available or provide no information.

    You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.

    To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:

    import chardet
    from w3lib.encoding import html_to_unicode
    
    def _guess_encoding(data):
        return chardet.detect(data).get('encoding')
    
    detected_encoding, html_content_unicode = html_to_unicode(
        content_type_header,
        html_content_bytes,
        default_encoding='utf8', 
        auto_detect_fun=_guess_encoding,
    )
    
    0 讨论(0)
  • 2020-11-30 21:09

    It seems like you need a hybrid of the answers presented:

    1. Fetch the page using urllib
    2. Find <meta> tags using beautiful soup or other method
    3. If no meta tags exist, check the headers returned by urllib
    4. If that still doesn't give you an answer, use the universal encoding detector.

    I honestly don't believe you're going to find anything better than that.

    In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.

    If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.

    0 讨论(0)
提交回复
热议问题