How to bypass cloudflare bot/ddos protection in Scrapy?

匿名 (未验证) 提交于 2019-12-03 01:48:02

问题:

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection.

It is using CloudFlare’s DDOS protection which is basically using JavaScript evaluation to filter out the browsers (and therefore scrapers) with JS disabled. Once the function is evaluated, the response with calculated number is generated. In return, service sends back two authentication cookies which attached to each request allow to normally crawl the site. Here's the description of how it works.

I have also found a cloudflare-scrape Python module that uses external JS evaluation engine to calculate the number and send the request back to server. I'm not sure how to integrate it into Scrapy though. Or maybe there's a smarter way without using JS execution? In the end, it's a form...

I'd apriciate any help.

回答1:

So I executed JavaScript using Python with help of cloudflare-scrape.

To your scraper, you need to add the following code:

  def start_requests(self):     cf_requests = []     for url in self.start_urls:       token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_')       cf_requests.append(Request(url=url,                       cookies={'__cfduid': token['__cfduid']},                       headers={'User-Agent': agent}))     return cf_requests 

alongside parsing functions. And that's it!

Of course, you need to install cloudflare-scrape first and import it to your spider. You also need a JS execution engine installed. I had Node.JS already, no complaints.



回答2:

If it is ok for you to compromise a little bit of speed during the scraping process, you can combine Scrapy with Selenium to emulate a real user interaction with the browser. I wrote a short tutorial about it here: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python.

It does not target your specific problem with CloudFlare, but it might help since I had similar issues when loading data that required of some JS execution.



回答3:

Obviously the best way to do this would be to whitelist your IP in CloudFlare; if this isn't suitable let me recommend the cloudflare-scrape library. You can use this to get the cookie token, then provide this cookie token in your Scrapy request back to the server.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!