Scrape title by only downloading relevant part of webpage

前端 未结 6 1715
深忆病人
深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答
  •  我在风中等你
    2021-02-05 11:21

    You can defer downloading the entire response body by enabling stream mode of requests.

    Requests 2.14.2 documentation - Advanced Usage

    By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

    ...

    If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

    So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

    Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

    try:
        from HTMLParser import HTMLParser
    except ImportError:
        from html.parser import HTMLParser
    
    import requests, re
    from contextlib import closing
    
    CHUNKSIZE = 1024
    retitle = re.compile("]*>(.*?)", re.IGNORECASE | re.DOTALL)
    buffer = ""
    htmlp = HTMLParser()
    with closing(requests.get("http://example.com/abc", stream=True)) as res:
        for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
            buffer = "".join([buffer, chunk])
            match = retitle.search(buffer)
            if match:
                print(htmlp.unescape(match.group(1)))
                break
    

提交回复
热议问题