Is there a version of selenium webdriver that is not detectable?

前端 未结 1 419
清歌不尽
清歌不尽 2020-11-28 11:37

I am running chrome driver over selenium on a ubuntu server. Behind a residential proxy network . Yet my selenium is being detected . Is there a way to make chrome driver an

相关标签:
1条回答
  • 2020-11-28 12:17

    The fact that selenium driven WebDriver gets detected doesn't depends on any specific Selenium, Chrome or ChromeDriver version. The Websites themselves can detect the network traffic and can identify the Browser Client i.e. Web Browser as WebDriver controled.

    However some generic approaches to avoid getting detected while web-scraping are as follows:

    • The first and foremost attribute a website can determine your script/program is through your monitor size. So it is recommended not to use the conventional Viewport.
    • If you need to send multiple requests to a website, you need to keep on changing the user-agent on each request. You can find a detailed discussion in Way to change Google Chrome user agent in Selenium?
    • To simulate human like behavior you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep webdriver in python for milliseconds

    @Antoine Vastel in his blog site Detecting Chrome Headless mentioned several approaches, which distinguish the Chrome browser from a headless Chrome browser.

    • User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:

      Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36
      
      • A check for the presence of Chrome headless can be done through:

        if (/HeadlessChrome/.test(window.navigator.userAgent)) {
            console.log("Chrome headless detected");
        }
        
    • Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.

      • A check for the presence of Plugins can be done through:

        if(navigator.plugins.length == 0) {
            console.log("It may be Chrome headless");
        }
        
    • Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.

      • A check for the presence of Languages can be done through:

        if(navigator.languages == "") {
             console.log("Chrome headless detected");
        }
        
    • WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.

      • A check for the presence of WebGL can be done through:

        var canvas = document.createElement('canvas');
        var gl = canvas.getContext('webgl');
        
        var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
        var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
        var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
        
        if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
            console.log("Chrome headless detected");
        }
        
      • Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.

    • Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.

      • A check for the presence of hairline feature can be done through:

        if(!Modernizr["hairline"]) {
            console.log("It may be Chrome headless");
        }
        
    • Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.

      • A check for the presence of Missing image can be done through:

        var body = document.getElementsByTagName("body")[0];
        var image = document.createElement("img");
        image.src = "http://iloveponeydotcom32188.jg";
        image.setAttribute("id", "fakeimage");
        body.appendChild(image);
        image.onerror = function(){
            if(image.width == 0 && image.height == 0) {
            console.log("Chrome headless detected");
            }
        }   
        

    References

    You can find a couple of similar discussions in:

    • How to bypass Google captcha with Selenium and python?
    • How to make Selenium script undetectable using GeckoDriver and Firefox through Python?

    tl; dr

    • Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
    • How does recaptcha 3 know I'm using selenium/chromedriver?
    • Selenium and non-headless browser keeps asking for Captcha
    0 讨论(0)
提交回复
热议问题