Headless chrome with python suspends when trying to download a file

问题

I'm using Python, Jupyter, Selenium webdriver and headless chrome (with Canary) on Mac.

I wrote a script that scrapes a very old website, In order to download a file from that website I need to click on several buttons which eventually lead me to a button that once clicked it downloads a CSV file

The problem is that when headless chrome tries to download the target file it suspends and does nothing (i.e. doesn't download the required file) even though the script finished running (and yes I did close it at the end of the script)

I tried:

Downloading other files (from different websites) and headless chrome seems to download them without any problems (I enabled the headless chrome option to download files successfully)
Taking snapshots of the websites to make sure its navigating correctly to the download page (and yeah, its navigating correctly)
Modifying the user agent (it appears to be using the user agent I expect it to)
Running the exact same code without the headless option - it downloads the file successfully with regular chrome
Changing plugins and languages JS script on the driver by using driver.execute_script(js_that_changes_plugins_and_langs) but I'm not quite sure how to check if its actually executing it or not (and its still not working)

Problems I'm facing:

I can't find a way to get just the last download URL because it seems to be using some unique IDs generated along the way (it's given when you go to the homepage and when you are navigating between pages in the site) so for every session its going to change
The navigation URLs seem to be originating from iframes inside the homepage (and also in the following URLs) and I'm not quite sure how to inspect the Javascript its generating

I don't have any problem providing the website URL but:

You have to go through like ~6 clicks on different pages in order to just get to the last page with the download button. These clicks are not intuitive and it will take a lot in order to explain how to navigate to where I want
This site is not in English which will make it even harder to explain how to navigate

I need it to be headless as opposed to regular chrome since the machine where we want to run the code is very weak and cannot run the chrome GUI

So my question is: does anyone knows what may the problem? or at least, how can I debug it?

this is more or less the code that I'm using:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def enable_download_in_headless_chrome(driver, download_dir):
        """
        there is currently a "feature" in chrome where
        headless does not allow file download: https://bugs.chromium.org/p/chromium/issues/detail?id=696481
        This method is a hacky work-around until the official chromedriver support for this.
        Requires chrome version 62.0.3196.0 or above.
        """

        # add missing support for chrome "send_command"  to selenium webdriver
        driver.command_executor._commands["send_command"] = ("POST", '/session/' + driver.session_id + '/chromium/send_command')

        params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
        command_result = driver.execute("send_command", params)
        print("response from browser:")
        for key in command_result:
            print("result:" + key + ":" + str(command_result[key]))

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('headless')
chrome_options.add_argument('no-sandbox')
chrome_options.add_argument('disable-gpu')
chrome_options.add_argument('remote-deubgging-port=9222')
chrome_options.add_argument('disable-popup-blocking')
chrome_options.add_argument('enable-logging')
download_dir = # some path here
driver = webdriver.Chrome(chrome_options=chrome_options)
enable_download_in_headless_chrome(driver, download_dir)
ok_button = driver.find_element_by_id('the-button-name')
ok_button.click()

Thanks for the help

回答1:

I think that there are too many moving parts here. If you really need selenium, and all others - well - that is OK. However I would start with something as simple as possible.

On Python 2.7 I was using mechanize - that way I was able to mimic whole communication with the server. Today that is not best option, since python 3.X is the way to go. I'll describe how I was working with this kind of problems. Just to give you better picture, and then I'll try to describe possible tools.

So typical case was login, go over the page, turn some switches, and trigger download, or fetch content and process it with beautiful soup. To start, you need to see what information is exchanged. Go to development tools in your web browser, and choose network tab. Perhaps you know that, but this step is mandatory, and I'm suppose to write general answer. Then do your normal work - just login, and do other steps. All things that the server takes care off must be transmitted, so you can see it as network requests. Mechanize was good since I was able to prepare dict, and sent it as a post request to the page. Writing about post - typical mistake is posting to the page address. So if you ware visited index.html, you are doing post on that page, while server expects it to be sent to add_user_data.html and after that you ware redirected. Things like session id, can be supported by header entry, or cookie - just look at network communication for the pattern.

As I wrote Python 2.7 is going to be discontinued. Mechanize is not available for Python 3.x, so other tools should be used. You can look for mechanize alternatives, and look what is OK for you. Typical answer is scrapy. That is a bit different tool used more to scrap web pages. So if you plan something bigger maybe that is best option. If you need single script - I would start with httpie. Command line tool / python package, good OSX support, you can send form, session management is also available. I'm using it everyday, however my server is stateless.

I would be more then happy providing exact examples, but without server information that is not possible. Can you please attach dump of your sample session? Anonymize it, and I'll provide sample sample, or maybe other tool can be petter?

回答2:

As you do not provide the URL from where you download its guessing work. Target most likely has a recapta-like wall installed to prevent scraping. So be sure you don't hit this "recapta" wall and if you do implement code that notifies you to perform a manual task for granting access.

For js this solution was given by zavodnyuk here:

try to set custom User-Agent with compatible one (e.g. from your real browser). capabilities: { 'browserName': 'chrome', chromeOptions: {args: [ "user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36", "--headless", "--disable-gpu" ] } worked for selenium/protractor on js

I hope this hints you in the right direction as there is not much about it described for python on the internet.

EDIT based on comment1:

In basic debugging mode I rely on print-statements at the start of possible candidate defs. Where I say printstatement it can be a write line to file as well. Not relying on thrid party fancy packages because I want to learn from the code most of the times and then is above approach time consuming but well worth the effort of spend time. For example how I bluntly debug:

def header_inspect(self, ID, action, data):
    print  'header_inspect, ID : %s\n, action : %s\nprocess-data : %s' % (ID, action, data)

回答3:

With no specific information, it looks like the only advises that we can give you will somehow be related to how you can understand what is going on.

What about proceeding step by step manually in headed mode for debugging purpose? The bet here is that your problem lies in the fact of automating your task rather than being headless.

Execute your script with all your imports and functions definitions (e.g. enable_download_in_headless_chrome), using none of those. Actually, do so until download_dir = # some path here, and then, in the Python Shell, type

>>> driver = webdriver.Chrome(chrome_options=chrome_options)

Now interact manually with your browser and open the Chrome DevTools and go the Console. Make sure that errors will be displayed. Let's continue and type the rest of your commands

>>> enable_download_in_headless_chrome(driver, download_dir)
>>> ...
>>> ok_button.click()

What does it say?

来源：https://stackoverflow.com/questions/48630484/headless-chrome-with-python-suspends-when-trying-to-download-a-file

标签

python

python-3.x

selenium-webdriver

google-chrome-headless