Cloudflare and Chromedriver - cloudflare distinguishes between chromedriver and genuine chrome?

别来无恙 提交于 2021-02-10 04:28:08

问题


I would like to use chromedriver to scrape some stories from fanfiction.net. I try the following:

from selenium import webdriver
import time

path = 'D:\chromedriver\chromedriver.exe'

browser = webdriver.Chrome(path)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'

browser.get(url1)
time.sleep(5)
browser.get(url2)

The first link opens (sometimes I have to wait 5 seconds). When I want to load the second url, cloudflare intervens and wants me to solve captchas - which are not solvable, atleast cloudflare does not recognize this. This happens also, if I enter the links manually in chromedriver (so in the GUI). However, if I do the same things in normal chrome, everything works just as fine (I do not even get the waiting period on the first link) - even in private mode and all cookies deleted. I could reproduce this on several machines. Now my question: To my intuition, chromedriver was just the normal chrome browser which allowed to be controlled. What is the difference to normal chrome, how does Cloudflare distinguish both, and how can I mask my chromedriver as normal chrome? (I do not intend to load many pages in very short time, so it should not look like a bot). I hope my question is clear


回答1:


This error message...

...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.


Solution

In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.

undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

  • Code Block:

    import undetected_chromedriver as uc
    from selenium import webdriver
    import time
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    driver = uc.Chrome(options=options)
    url1 = 'https://www.fanfiction.net/s/8832472'
    url2 = 'https://www.fanfiction.net/s/5218118'
    driver.get(url1)
    time.sleep(5)
    driver.get(url2)
    

References

You can find a couple of relevant detailed discussions in:

  • Selenium app redirect to Cloudflare page when hosted on Heroku
  • How to bypass being rate limited ..HTML Error 1015 using Python


来源:https://stackoverflow.com/questions/65636102/cloudflare-and-chromedriver-cloudflare-distinguishes-between-chromedriver-and

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!