HTTP 403 Responses when using Python Scrapy

后端 未结 2 769
后悔当初
后悔当初 2020-12-31 14:24

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored

相关标签:
2条回答
  • 2020-12-31 14:44

    I do not if this still available, but I have to put the next lines in the setting.py file:

    HTTPERROR_ALLOWED_CODES  =[404]
    USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
    

    hope it helps.

    0 讨论(0)
  • 2020-12-31 14:49

    HTTP Status Code 403 definitely means Forbidden / Access Denied.
    HTTP Status Code 302 is for redirection of requests. No need to worry about them.
    Nothing seems to be wrong in your code.

    Yes, it's definitely an anti-scraping measure implemented by the site.

    Refer these guidelines from Scrapy Docs: Avoid Getting Banned

    Also, you should consider pausing and resuming crawls.

    0 讨论(0)
提交回复
热议问题