问题
I would like to use chromedriver to scrape some stories from fanfiction.net. I try the following:
from selenium import webdriver
import time
path = 'D:\chromedriver\chromedriver.exe'
browser = webdriver.Chrome(path)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'
browser.get(url1)
time.sleep(5)
browser.get(url2)
The first link opens (sometimes I have to wait 5 seconds). When I want to load the second url, cloudflare intervens and wants me to solve captchas - which are not solvable, atleast cloudflare does not recognize this. This happens also, if I enter the links manually in chromedriver (so in the GUI). However, if I do the same things in normal chrome, everything works just as fine (I do not even get the waiting period on the first link) - even in private mode and all cookies deleted. I could reproduce this on several machines. Now my question: To my intuition, chromedriver was just the normal chrome browser which allowed to be controlled. What is the difference to normal chrome, how does Cloudflare distinguish both, and how can I mask my chromedriver as normal chrome? (I do not intend to load many pages in very short time, so it should not look like a bot). I hope my question is clear
回答1:
This error message...
...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc from selenium import webdriver import time options = webdriver.ChromeOptions() options.add_argument("start-maximized") driver = uc.Chrome(options=options) url1 = 'https://www.fanfiction.net/s/8832472' url2 = 'https://www.fanfiction.net/s/5218118' driver.get(url1) time.sleep(5) driver.get(url2)
References
You can find a couple of relevant detailed discussions in:
- Selenium app redirect to Cloudflare page when hosted on Heroku
- How to bypass being rate limited ..HTML Error 1015 using Python
来源:https://stackoverflow.com/questions/65636102/cloudflare-and-chromedriver-cloudflare-distinguishes-between-chromedriver-and