Python Scrapy 301 redirects

后端未结

关注

 1  510

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My cur

相关标签:

1条回答

清歌不尽

2021-02-06 13:25
To parse any responses that are not 200 you'd need to do one of these things:

Project-wide

You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.

Spider-wide

Add handle_httpstatus_list parameter to your spider. In your case something like:
```
class MySpider(scrapy.Spider):
    handle_httpstatus_list = [301]
    # or 
    handle_httpstatus_all = True
```
Request-wide

You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:
```
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
```
To learn more see HttpErrorMiddleware
0 讨论(0)
发布评论:

提交评论
- 加载中...

Python Scrapy 301 redirects

Project-wide

Spider-wide

Request-wide