Can a website detect when you are using selenium with chromedriver?

后端 未结 19 2666
情歌与酒
情歌与酒 2020-11-21 05:41

I\'ve been testing out Selenium with Chromedriver and I noticed that some pages can detect that you\'re using Selenium even though there\'s no automation at all. Even when I

相关标签:
19条回答
  • 2020-11-21 06:11

    As we've already figured out in the question and the posted answers, there is an anti Web-scraping and a Bot detection service called "Distil Networks" in play here. And, according to the company CEO's interview:

    Even though they can create new bots, we figured out a way to identify Selenium the a tool they’re using, so we’re blocking Selenium no matter how many times they iterate on that bot. We’re doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious.

    It'll take time and additional challenges to understand how exactly they are detecting Selenium, but what can we say for sure at the moment:

    • it's not related to the actions you take with selenium - once you navigate to the site, you get immediately detected and banned. I've tried to add artificial random delays between actions, take a pause after the page is loaded - nothing helped
    • it's not about browser fingerprint either - tried it in multiple browsers with clean profiles and not, incognito modes - nothing helped
    • since, according to the hint in the interview, this was "reverse engineering", I suspect this is done with some JS code being executed in the browser revealing that this is a browser automated via selenium webdriver

    Decided to post it as an answer, since clearly:

    Can a website detect when you are using selenium with chromedriver?

    Yes.


    Also, what I haven't experimented with is older selenium and older browser versions - in theory, there could be something implemented/added to selenium at a certain point that Distil Networks bot detector currently relies on. Then, if this is the case, we might detect (yeah, let's detect the detector) at what point/version a relevant change was made, look into changelog and changesets and, may be, this could give us more information on where to look and what is it they use to detect a webdriver-powered browser. It's just a theory that needs to be tested.

    0 讨论(0)
  • 2020-11-21 06:12

    Additionally to the great answer of @Erti-Chris Eelmaa - there's annoying window.navigator.webdriver and it is read-only. Event if you change the value of it to false it will still have true. Thats why the browser driven by automated software can still be detected. MDN

    The variable is managed by the flag --enable-automation in chrome. The chromedriver launches chrome with that flag and chrome sets the window.navigator.webdriver to true. You can find it here. You need to add to "exclude switches" the flag. For instance (golang):

    package main
    
    import (
        "github.com/tebeka/selenium"
        "github.com/tebeka/selenium/chrome"
    )
    
    func main() {
    
    caps := selenium.Capabilities{
        "browserName": "chrome",
    }
    
    chromeCaps := chrome.Capabilities{
        Path:            "/path/to/chrome-binary",
        ExcludeSwitches: []string{"enable-automation"},
    }
    caps.AddChrome(chromeCaps)
    
    wd, err := selenium.NewRemote(caps, fmt.Sprintf("http://localhost:%d/wd/hub", 4444))
    }
    
    0 讨论(0)
  • 2020-11-21 06:14

    Basically the way the selenium detection works, is that they test for pre-defined javascript variables which appear when running with selenium. The bot detection scripts usually look anything containing word "selenium" / "webdriver" in any of the variables (on window object), and also document variables called $cdc_ and $wdc_. Of course, all of this depends on which browser you are on. All the different browsers expose different things.

    For me, I used chrome, so, all that I had to do was to ensure that $cdc_ didn't exist anymore as document variable, and voila (download chromedriver source code, modify chromedriver and re-compile $cdc_ under different name.)

    this is the function I modified in chromedriver:

    call_function.js:

    function getPageCache(opt_doc) {
      var doc = opt_doc || document;
      //var key = '$cdc_asdjflasutopfhvcZLmcfl_';
      var key = 'randomblabla_';
      if (!(key in doc))
        doc[key] = new Cache();
      return doc[key];
    }
    

    (note the comment, all I did I turned $cdc_ to randomblabla_.

    Here is a pseudo-code which demonstrates some of the techniques that bot networks might use:

    runBotDetection = function () {
        var documentDetectionKeys = [
            "__webdriver_evaluate",
            "__selenium_evaluate",
            "__webdriver_script_function",
            "__webdriver_script_func",
            "__webdriver_script_fn",
            "__fxdriver_evaluate",
            "__driver_unwrapped",
            "__webdriver_unwrapped",
            "__driver_evaluate",
            "__selenium_unwrapped",
            "__fxdriver_unwrapped",
        ];
    
        var windowDetectionKeys = [
            "_phantom",
            "__nightmare",
            "_selenium",
            "callPhantom",
            "callSelenium",
            "_Selenium_IDE_Recorder",
        ];
    
        for (const windowDetectionKey in windowDetectionKeys) {
            const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
            if (window[windowDetectionKeyValue]) {
                return true;
            }
        };
        for (const documentDetectionKey in documentDetectionKeys) {
            const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
            if (window['document'][documentDetectionKeyValue]) {
                return true;
            }
        };
    
        for (const documentKey in window['document']) {
            if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
                return true;
            }
        }
    
        if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;
    
        if (window['document']['documentElement']['getAttribute']('selenium')) return true;
        if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
        if (window['document']['documentElement']['getAttribute']('driver')) return true;
    
        return false;
    };
    

    according to user @szx, it is also possible to simply open chromedriver.exe in hex editor, and just do the replacement manually, without actually doing any compiling.

    0 讨论(0)
  • 2020-11-21 06:14

    It sounds like they are behind a web application firewall. Take a look at modsecurity and owasp to see how those work. In reality, what you are asking is how to do bot detection evasion. That is not what selenium web driver is for. It is for testing your web application not hitting other web applications. It is possible, but basically, you'd have to look at what a WAF looks for in their rule set and specifically avoid it with selenium if you can. Even then, it might still not work because you don't know what WAF they are using. You did the right first step, that is faking the user agent. If that didn't work though, then a WAF is in place and you probably need to get more tricky.

    Edit: Point taken from other answer. Make sure your user agent is actually being set correctly first. Maybe have it hit a local web server or sniff the traffic going out.

    0 讨论(0)
  • 2020-11-21 06:15

    partial interface Navigator { readonly attribute boolean webdriver; };

    The webdriver IDL attribute of the Navigator interface must return the value of the webdriver-active flag, which is initially false.

    This property allows websites to determine that the user agent is under control by WebDriver, and can be used to help mitigate denial-of-service attacks.

    Taken directly from the 2017 W3C Editor's Draft of WebDriver. This heavily implies that at the very least, future iterations of selenium's drivers will be identifiable to prevent misuse. Ultimately, it's hard to tell without the source code, what exactly causes chrome driver in specific to be detectable.

    0 讨论(0)
  • 2020-11-21 06:16

    The bot detection I've seen seems more sophisticated or at least different than what I've read through in the answers below.

    EXPERIMENT 1:

    1. I open a browser and web page with Selenium from a Python console.
    2. The mouse is already at a specific location where I know a link will appear once the page loads. I never move the mouse.
    3. I press the left mouse button once (this is necessary to take focus from the console where Python is running to the browser).
    4. I press the left mouse button again (remember, cursor is above a given link).
    5. The link opens normally, as it should.

    EXPERIMENT 2:

    1. As before, I open a browser and the web page with Selenium from a Python console.

    2. This time around, instead of clicking with the mouse, I use Selenium (in the Python console) to click the same element with a random offset.

    3. The link doesn't open, but I am taken to a sign up page.

    IMPLICATIONS:

    • opening a web browser via Selenium doesn't preclude me from appearing human
    • moving the mouse like a human is not necessary to be classified as human
    • clicking something via Selenium with an offset still raises the alarm

    Seems mysterious, but I guess they can just determine whether an action originates from Selenium or not, while they don't care whether the browser itself was opened via Selenium or not. Or can they determine if the window has focus? Would be interesting to hear if anyone has any insights.

    0 讨论(0)
提交回复
热议问题