Website using DataDome gets captcha blocked while scraping using Selenium and Python

倾然丶 夕夏残阳落幕 提交于 2020-12-21 04:01:54

问题


I'm actually trying to scrape some car datas from different websites, i've been using selenium with chromebrowser but some websites actually block selenium with captcha validation(example: https://www.leboncoin.fr/), and this in just 1 or 2 requests. I tried changing $_cdc in the chromebrowser but this didn't resolve the problem, and I've been using those options for the chromebrowser

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
options = webdriver.ChromeOptions()
options.add_argument(f'user-agent={user_agent}')
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--profile-directory=Default')
options.add_argument("--incognito")
options.add_argument("--disable-plugins-discovery")
options.add_experimental_option("excludeSwitches", ["ignore-certificate-errors", "safebrowsing-disable-download-protection", "safebrowsing-disable-auto-update", "disable-client-side-phishing-detection"])
options.add_argument('--disable-extensions')
browser = webdriver.Chrome(chrome_options=options)

browser.delete_all_cookies()

browser.set_window_size(800,800)

browser.set_window_position(0,0)

The website I'm trying to scrape uses DataDome for bot security, any clue ?


回答1:


A bit more details about your usecase on scraping car datas from different websites or from https://www.leboncoin.fr/ would have helped us to construct a more canonical answer. However, I was able to access the Page Source using Selenium as follows:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.leboncoin.fr/')
    print(driver.page_source)
    
  • Console Output:

    <html class="gServer"><head><link rel="preconnect" href="//fonts.googleapis.com" crossorigin=""><link rel="preload" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin="" as="style"><link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin=""><style data-emotion-css=""></style><meta charset="utf-8"><link rel="manifest" href="/manifest.json"><link type="application/opensearchdescription+xml" rel="search" href="/opensearch.xml"><meta name="theme-color" content="#ff6e14"><meta property="og:locale" content="fr_FR"><meta property="og:site_name" content="leboncoin"><meta name="twitter:site" content="leboncoin"><meta http-equiv="P3P" content="CP=&quot;This is not a P3P policy&quot;"><meta name="viewport" content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=0"><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://tp.realytics.io/sync/se/cnktbDNiMG5jb3xyeV83NTFGRUQwMy1CMDdGLTRBQTgtOTAxRi1DNUREMDVGRjkxQTJ8?ct=1&amp;rt=1&amp;u=https%3A%2F%2Fwww.leboncoin.fr%2F&amp;r=&amp;ts=1591306049397"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-766292687&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-667462656&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://cdn-eu.realytics.net/realytics-1.2.min.js"></script><script type="text/javascript" async="" src="https://i.realytics.io/tc.js?cb=1591306047755"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=DC-4167650&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-744431185&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" charset="utf-8" src="//www.googleadservices.com/pagead/conversion_async.js" id="utag_82"></script><script type="text/javascript" async="" charset="utf-8" src="//sdk.mpianalytics.com/pulse.min.js" id="utag_47"></script><script async="true" type="text/javascript" src="https://sslwidget.criteo.com/event?a=50103&amp;v=5.5.0&amp;p0=e%3Dexd%26site_type%3Dd&amp;p1=e%3Dvh&amp;p2=e%3Ddis&amp;adce=1&amp;tld=leboncoin.fr&amp;dtycbr=6569" data-owner="criteo-tag"></script><script type="text/javascript" src="//try.abtasty.com/09643a1c5bc909059579da8aac99e8f1.js"></script><script>window.dataLayer = window.dataLayer || [];
    .
    .
    .
    <iframe height="1" width="1" style="display:none" src="//4167650.fls.doubleclick.net/activityi;src=4167650;type=slbc01;cat=all-site;u1=homepage;ord=9979622847645.51?" id="utag_179_iframe"></iframe></body></html>
    

However, it's quite evident from the DOM Tree that the website is protected from Bad Bots through DataDome as in:


DataDome

The key features are as follows:

  • DataDome is the only bot protection solution delivered as-a-service.
  • DataDome requires no architecture changes or DNS rerouting.
  • DataDome's bot detection engine compares every request to the website with a massive in-memory pattern database, and uses a blend of AI and machine learning to decide in less than 2 milliseconds whether access to your pages should be granted or not.
  • DataDome detects and identifies 100% of OWASP automated threats.
  • DataDome's Custom Rules function can even allow you to block human traffic from countries you are not selling to, or to allow partner bots to access your site only in specific circumstances.

Outro

Documentation on DataDoe can be found at:

  • Bot detection
  • Server-side bot detection is not enough



回答2:


It could be happening due to a myriad of reasons. Try going through the answer here that gives someway in you can prevent this problem.

A simple solution that worked for me sometimes is to use Waits/Sleep calls in selenium, see here from the docs about Waits. Or sleep calls can be done like so

Import time
time.sleep(2)


来源:https://stackoverflow.com/questions/62199222/website-using-datadome-gets-captcha-blocked-while-scraping-using-selenium-and-py

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!