Chrome crashes after several hours while multiprocessing using Selenium through Python

前端 未结 2 2070
感情败类
感情败类 2020-12-04 03:59

This is the error traceback after several hours of scraping:

The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeD         


        
相关标签:
2条回答
  • 2020-12-04 04:15

    Right now im using this threading module to instantiate one Webdriver each thread

    import threading
    threadLocal = threading.local()
    
    def get_driver():
        browser = getattr(threadLocal, 'browser', None)
        if browser is None:
            chrome_options = Options()
            chrome_options.add_argument('--no-sandbox')
            chrome_options.add_argument("--headless")
            chrome_options.add_argument('--disable-dev-shm-usage')
            chrome_options.add_argument("--lang=en")
            chrome_options.add_argument("--start-maximized")
            chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
            chrome_options.add_experimental_option('useAutomationExtension', False)
            chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
            chrome_options.binary_location = "/usr/bin/google-chrome"
            browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options)
            setattr(threadLocal, 'browser', browser)
        return browser
    

    and it really helps me to scrape faster than executing one driver at a time.

    0 讨论(0)
  • 2020-12-04 04:21

    I took your code, modified it a bit to suit to my Test Environment and here is the execution results:

    • Code Block:

      • multiprocess.py:

        import time
        from multiprocessing import Pool
        from multiprocessingPool.scrape import run_scrape
        
        if __name__ == '__main__':
            start_time = time.time()
            links = ["https://selenium.dev/downloads/", "https://selenium.dev/documentation/en/"] 
            pool = Pool(2)
            results = pool.map(run_scrape, links)
            pool.close()
            print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time)) 
        
      • scrape.py:

        from selenium import webdriver
        from selenium.common.exceptions import NoSuchElementException, TimeoutException
        from selenium.webdriver.common.by import By
        from selenium.webdriver.chrome.options import Options
        
        def run_scrape(link):
            chrome_options = Options()
            chrome_options.add_argument('--no-sandbox')
            chrome_options.add_argument("--headless")
            chrome_options.add_argument('--disable-dev-shm-usage')
            chrome_options.add_argument("--lang=en")
            chrome_options.add_argument("--start-maximized")
            chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
            chrome_options.add_experimental_option('useAutomationExtension', False)
            chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
            chrome_options.binary_location=r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
            browser = webdriver.Chrome(executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe', options=chrome_options)
            browser.get(link)
            try:
                print(browser.title)
            except (NoSuchElementException, TimeoutException):
                print("Error")
            browser.quit()
        
    • Console Output:

      Downloads
      The Selenium Browser Automation Project :: Documentation for Selenium
      Total Time Processed: --- 10.248600006103516 seconds ---
      

    Conclusion

    It is pretty much evident your program is logically flawless and just perfect.


    This usecase

    As you mentioned this error surfaces after several hours of scraping, I suspect this due to the fact that WebDriver is not thread-safe. Having said that, if you can serialize access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.

    Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.

    Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.

    0 讨论(0)
提交回复
热议问题