Requests - get content-type/size without fetching the whole page/content

后端未结

关注

 4  1193

I have a simple website crawler, it works fine, but sometime it stuck because of large content such as ISO images, .exe files and other large stuff. Guessing content-type using

相关标签:

4条回答

南笙

2021-02-07 13:03
Sorry, my mistake, I should read documentation better. Here is the answer: http://docs.python-requests.org/en/latest/user/advanced/#advanced (Body Content Workflow)
```
tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
  r.connection.close()
  # log request too long
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2021-02-07 13:04
Use requests.head() for this. It will not return the message body. You should use head method if you are interested only in the headers. Check this link for detail.
```
h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-02-07 13:06
Because requests.head() does NOT auto redirect, so a URL is redirected, requests.head() will get 0 for Content-Length. So make sure allow_redirects=True is added.
```
r = requests.head(url, allow_redirects=True)
length = r.headers['Content-Length']
```
Refer to Requests Redirection And History
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-02-07 13:16
Yes.

You can use the Session.head method to create HEAD requests:
```
response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']
```
A HEAD request similar to GET request, except that the message body would not be sent.

Here is a quote from Wikipedia:

HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
0 讨论(0)
发布评论:

提交评论
- 加载中...