Headless chrome with python suspends when trying to download a file

无人久伴 提交于 2019-12-05 03:53:59

I think that there are too many moving parts here. If you really need selenium, and all others - well - that is OK. However I would start with something as simple as possible.

On Python 2.7 I was using mechanize - that way I was able to mimic whole communication with the server. Today that is not best option, since python 3.X is the way to go. I'll describe how I was working with this kind of problems. Just to give you better picture, and then I'll try to describe possible tools.

So typical case was login, go over the page, turn some switches, and trigger download, or fetch content and process it with beautiful soup. To start, you need to see what information is exchanged. Go to development tools in your web browser, and choose network tab. Perhaps you know that, but this step is mandatory, and I'm suppose to write general answer. Then do your normal work - just login, and do other steps. All things that the server takes care off must be transmitted, so you can see it as network requests. Mechanize was good since I was able to prepare dict, and sent it as a post request to the page. Writing about post - typical mistake is posting to the page address. So if you ware visited index.html, you are doing post on that page, while server expects it to be sent to add_user_data.html and after that you ware redirected. Things like session id, can be supported by header entry, or cookie - just look at network communication for the pattern.

As I wrote Python 2.7 is going to be discontinued. Mechanize is not available for Python 3.x, so other tools should be used. You can look for mechanize alternatives, and look what is OK for you. Typical answer is scrapy. That is a bit different tool used more to scrap web pages. So if you plan something bigger maybe that is best option. If you need single script - I would start with httpie. Command line tool / python package, good OSX support, you can send form, session management is also available. I'm using it everyday, however my server is stateless.

I would be more then happy providing exact examples, but without server information that is not possible. Can you please attach dump of your sample session? Anonymize it, and I'll provide sample sample, or maybe other tool can be petter?

As you do not provide the URL from where you download its guessing work. Target most likely has a recapta-like wall installed to prevent scraping. So be sure you don't hit this "recapta" wall and if you do implement code that notifies you to perform a manual task for granting access.

For js this solution was given by zavodnyuk here:

try to set custom User-Agent with compatible one (e.g. from your real browser). capabilities: { 'browserName': 'chrome', chromeOptions: {args: [ "user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36", "--headless", "--disable-gpu" ] } worked for selenium/protractor on js

I hope this hints you in the right direction as there is not much about it described for python on the internet.

EDIT based on comment1:

In basic debugging mode I rely on print-statements at the start of possible candidate defs. Where I say printstatement it can be a write line to file as well. Not relying on thrid party fancy packages because I want to learn from the code most of the times and then is above approach time consuming but well worth the effort of spend time. For example how I bluntly debug:

def header_inspect(self, ID, action, data):
    print  'header_inspect, ID : %s\n, action : %s\nprocess-data : %s' % (ID, action, data)

With no specific information, it looks like the only advises that we can give you will somehow be related to how you can understand what is going on.

What about proceeding step by step manually in headed mode for debugging purpose? The bet here is that your problem lies in the fact of automating your task rather than being headless.

Execute your script with all your imports and functions definitions (e.g. enable_download_in_headless_chrome), using none of those. Actually, do so until download_dir = # some path here, and then, in the Python Shell, type

>>> driver = webdriver.Chrome(chrome_options=chrome_options)

Now interact manually with your browser and open the Chrome DevTools and go the Console. Make sure that errors will be displayed. Let's continue and type the rest of your commands

>>> enable_download_in_headless_chrome(driver, download_dir)
>>> ...
>>> ok_button.click()

What does it say?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!