I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My cur
To parse any responses that are not 200 you'd need to do one of these things:
You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...]
in settings.py
file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True
instead.
Add handle_httpstatus_list
parameter to your spider. In your case something like:
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301]
# or
handle_httpstatus_all = True
You can set these meta
keys in your requests handle_httpstatus_list = [301, 302,...]
or handle_httpstatus_all = True
for all:
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
To learn more see HttpErrorMiddleware