Scrape title by only downloading relevant part of webpage

前端 未结 6 1714
深忆病人
深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答
  •  误落风尘
    2021-02-05 11:16

    using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:

    • it depends on the server to honor the request
    • you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
    • (edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code

      import urllib

      req = urllib.request.Request('http://www.python.org/')

      req.headers['Range']='bytes=%s-%s' % (0, 300)

      f = urllib.request.urlopen(req)

      just to verify if server accepted our range:

      content_range=f.headers.get('Content-Range')

      print(content_range)

提交回复
热议问题