Scrape title by only downloading relevant part of webpage

前端未结

关注

 6  1715

深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答

我在风中等你 (楼主)

2021-02-05 11:21
You can defer downloading the entire response body by enabling stream mode of requests.

Requests 2.14.2 documentation - Advanced Usage

By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

...

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

Here's an error-prone code tested with Python 2.7.10 and 3.6.0:
```
try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
retitle = re.compile("]*>(.*?)", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = retitle.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)))
            break
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...