Developing scraping script on docker image - how to overcome lack of visual browser?

倖福魔咒の 提交于 2019-12-11 15:29:42

问题


I want to scrape info from the web and a previous attempt has taught me that docker would have been useful to run my script on since I develop the script on mac os x and then run it on a vm often ubuntu it often won't run since the dependencies don't exist on ubuntu and have proven difficult to build.

Docker overcomes the dependency issue, but this now leads me to a different problem in that I need to develop the script in non-headless mode on the docker image to see what it's doing (or at least I think I do) but on docker I don't think it's possible to run the browser in non-headless mode.

How do others overcome this issue or otherwise get around it?

I'm using python3, selenium on this image that @Harald Norgren helped me build here

This is the sort of script I'm running, but it doesn't really do anything yet, I'm just including it to provide more background in it's helpful.

import csv
import time
from selenium import webdriver
import os
import logging #logging.warning(data_store+file)
import json

project_dir = os.path.dirname(os.path.realpath(__file__))
data_store = project_dir+"/trends-data/"
archive_folder = "archive"
data_archive = data_store + archive_folder + "/"

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
prefs = {"download.default_directory" : data_store}
chromeOptions.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(
    project_dir+'/chromedriver',
    chrome_options=chromeOptions
)

driver.get('https://trends.google.co.uk/trends/explore?q=query');
time.sleep(5)
driver.find_element_by_class_name("ic_googleplus_reshare").click()
time.sleep(5)
driver.find_element_by_class_name("csv-image").click()
time.sleep(5)
driver.quit()

回答1:


Develop the script locally in a python3 venv with headed Chrome first, then you can run it with Docker once the visual scraping is completed to avoid any dependency issues.

Also, for Docker to run headless Chrome, in your chromeOptions also add this argument:

chromeOptions.add_argument("no-sandbox")


来源:https://stackoverflow.com/questions/47977620/developing-scraping-script-on-docker-image-how-to-overcome-lack-of-visual-brow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!