问题
I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False
and handle_httpstatus_list = [500, 301, 302]
to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED
is set to False
, the spider doesn't goes to the target url in Location
response header. How can I achieve this ?
回答1:
It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters.
But I seem to recall you can do it along the lines of:
def parse(self,response):
# do whatever you need to do .... then
if response.status in [301, 302] and 'Location' in response.headers:
# test to see if it is an absolute or relative URL
newurl = urljoin(request.url, response.headers['location'])
# or
newurl = response.headers['location']
yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever)
来源:https://stackoverflow.com/questions/36124429/scrapy-handle-301-302-response-code-as-well-as-follow-the-target-url