Scrape title by only downloading relevant part of webpage

前端未结

关注

 6  1725

深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答

误落风尘 (楼主)

2021-02-05 11:16
using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:
- it depends on the server to honor the request
- you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
- (edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code
  
  import urllib
  
  req = urllib.request.Request('http://www.python.org/')
  
  req.headers['Range']='bytes=%s-%s' % (0, 300)
  
  f = urllib.request.urlopen(req)
  
  just to verify if server accepted our range:
  
  content_range=f.headers.get('Content-Range')
  
  print(content_range)
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

Scrape title by only downloading relevant part of webpage

just to verify if server accepted our range: