python-requests: fetching the head of the response content without consuming it all

前端 未结 2 1976
南旧
南旧 2020-12-30 16:27

Using python-requests and python-magic, I would like to test the mime-type of a web resource without fetching all its content (especially if this resource happens to be eg.

2条回答
  •  囚心锁ツ
    2020-12-30 17:21

    Note: at the time this question was asked, the correct method to fetch only headers stream the body was to use prefetch=False. That option has since been renamed to stream and the boolean value is inverted, so you want stream=True.

    The original answer follows.


    Once you use iter_content(), you have to continue using it; .text indirectly uses the same interface under the hood (via .content).

    In other words, by using iter_content() at all, you have to do the work .text does by hand:

    from requests.compat import chardet
    
    r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
    peek = r.iter_content(256).next()
    mime = magic.from_buffer(peek, mime=True)
    
    if mime == "text/html":
        contents = peek + b''.join(r.iter_content(10 * 1024))
        encoding = r.encoding
        if encoding is None:
            # detect encoding
            encoding = chardet.detect(contents)['encoding']
        try:
            textcontent = str(contents, encoding, errors='replace')
        except (LookupError, TypeError):
            textcontent = str(contents, errors='replace')
        print(textcontent)
    

    presuming you use Python 3.

    The alternative is to make 2 requests:

    r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
    mime = magic.from_buffer(r.iter_content(256).next(), mime=True)
    
    if mime == "text/html":
         print(r.requests.get("http://www.december.com/html/demo/hello.html").text)
    

    Python 2 version:

    r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
    peek = r.iter_content(256).next()
    mime = magic.from_buffer(peek, mime=True)
    
    if mime == "text/html":
        contents = peek + ''.join(r.iter_content(10 * 1024))
        encoding = r.encoding
        if encoding is None:
            # detect encoding
            encoding = chardet.detect(contents)['encoding']
        try:
            textcontent = unicode(contents, encoding, errors='replace')
        except (LookupError, TypeError):
            textcontent = unicode(contents, errors='replace')
        print(textcontent)
    

提交回复
热议问题