Python Scrapy 301 redirects

后端 未结 1 507
一生所求
一生所求 2021-02-06 13:24

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My cur

相关标签:
1条回答
  • 2021-02-06 13:25

    To parse any responses that are not 200 you'd need to do one of these things:

    Project-wide

    You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.

    Spider-wide

    Add handle_httpstatus_list parameter to your spider. In your case something like:

    class MySpider(scrapy.Spider):
        handle_httpstatus_list = [301]
        # or 
        handle_httpstatus_all = True
    

    Request-wide

    You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:

    scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
    

    To learn more see HttpErrorMiddleware

    0 讨论(0)
提交回复
热议问题