Developing scraping script on docker image - how to overcome lack of visual browser?

问题

I want to scrape info from the web and a previous attempt has taught me that docker would have been useful to run my script on since I develop the script on mac os x and then run it on a vm often ubuntu it often won't run since the dependencies don't exist on ubuntu and have proven difficult to build.

Docker overcomes the dependency issue, but this now leads me to a different problem in that I need to develop the script in non-headless mode on the docker image to see what it's doing (or at least I think I do) but on docker I don't think it's possible to run the browser in non-headless mode.

How do others overcome this issue or otherwise get around it?

I'm using python3, selenium on this image that @Harald Norgren helped me build here

This is the sort of script I'm running, but it doesn't really do anything yet, I'm just including it to provide more background in it's helpful.

import csv
import time
from selenium import webdriver
import os
import logging #logging.warning(data_store+file)
import json

project_dir = os.path.dirname(os.path.realpath(__file__))
data_store = project_dir+"/trends-data/"
archive_folder = "archive"
data_archive = data_store + archive_folder + "/"

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
prefs = {"download.default_directory" : data_store}
chromeOptions.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(
    project_dir+'/chromedriver',
    chrome_options=chromeOptions
)

driver.get('https://trends.google.co.uk/trends/explore?q=query');
time.sleep(5)
driver.find_element_by_class_name("ic_googleplus_reshare").click()
time.sleep(5)
driver.find_element_by_class_name("csv-image").click()
time.sleep(5)
driver.quit()

回答1:

Develop the script locally in a python3 venv with headed Chrome first, then you can run it with Docker once the visual scraping is completed to avoid any dependency issues.

Also, for Docker to run headless Chrome, in your chromeOptions also add this argument:

chromeOptions.add_argument("no-sandbox")

来源：https://stackoverflow.com/questions/47977620/developing-scraping-script-on-docker-image-how-to-overcome-lack-of-visual-brow

标签

python-3.x

Docker

selenium-webdriver

headless-browser