How to reuse a selenium driver instance during parallel processing?

前端 未结 1 814
醉梦人生
醉梦人生 2021-01-27 23:55

To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:

  • Challenge 1 is to speed up this process. In t
相关标签:
1条回答
  • 2021-01-28 00:30

    1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.

    2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)

    import threading
    from joblib import Parallel, delayed
    from selenium import webdriver
    
    def scrape(URL):
        try:
            driver = drivers[threading.current_thread().name]
        except KeyError:
            drivers[threading.current_thread().name] = webdriver.Firefox()
            driver = drivers[threading.current_thread().name]
        driver.get(URL)
        results = do_something(driver)
        if results:
            safe_results("results.csv")
    
    drivers = {}
    Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
    for driver in drivers.values():
        driver.quit()
    

    But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).

    0 讨论(0)
提交回复
热议问题