I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored
I do not if this still available, but I have to put the next lines in the setting.py file:
HTTPERROR_ALLOWED_CODES =[404]
USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
hope it helps.
HTTP Status Code 403 definitely means Forbidden / Access Denied.
HTTP Status Code 302 is for redirection of requests. No need to worry about them.
Nothing seems to be wrong in your code.
Yes, it's definitely an anti-scraping measure implemented by the site.
Refer these guidelines from Scrapy Docs: Avoid Getting Banned
Also, you should consider pausing and resuming crawls.