Scrapy handle 301/302 response code as well as follow the target url

问题

I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False and handle_httpstatus_list = [500, 301, 302] to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED is set to False, the spider doesn't goes to the target url in Location response header. How can I achieve this ?

回答1:

It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters.

But I seem to recall you can do it along the lines of:

def parse(self,response):
    # do whatever you need to do .... then
    if response.status in [301, 302] and 'Location' in response.headers:
        # test to see if it is an absolute or relative URL
        newurl = urljoin(request.url, response.headers['location'])
        # or 
        newurl = response.headers['location']
        yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever)

来源：https://stackoverflow.com/questions/36124429/scrapy-handle-301-302-response-code-as-well-as-follow-the-target-url

标签

web-scraping

scrapy

scrapy-spider

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!