How to download a file with Python, Selenium and PhantomJS

后端 未结 4 560
无人及你
无人及你 2021-01-06 12:31

Here is my situation: I have to login to a Website and download a CSV from there, headless from a linux server. The page uses JS and does not work without it.

After

相关标签:
4条回答
  • 2021-01-06 13:03

    You can try something like:

    from requests.auth import HTTPBasicAuth
    import requests
    
    url = "http://some_site/files?file=file.csv"  # URL used to download file
    #  GET-request to get file content using your web-site's credentials to access file
    r = requests.get(url, auth=HTTPBasicAuth("your_username", "your_password"))
    #  Saving response content to file on your computer
    with open("path/to/folder/to/save/file/filename.csv", 'w') as my_file:
        my_file.write(r.content)
    
    0 讨论(0)
  • 2021-01-06 13:19

    If the button that you want to download has the file link, you are able to test downloading it using python code, because PhantonJs does not support a download by itself. So, if your download button does not provide the file link, you're not able to test.

    To test using file link and phyton (to assert that file exists), you can follow this topic. As I'm a C# developer and testes, I don't know the better way to write the code in python without errors, but Im sure you can:

    Basic http file downloading and saving to disk in python?

    0 讨论(0)
  • 2021-01-06 13:20

    I recently used Selenium to utilize ChromeDriver to download a file from the web. This works because Chrome automatically downloads the file and stores it in the Downloads file for you. This was easier than using PhantomJS.

    I recommend looking into using ChromeDriver with Selenium and going that route: https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver

    EDIT - As pointed out below, I neglected to point to how to set up ChromeDriver to run in headless mode. Here's more info: http://www.chrisle.me/2013/08/running-headless-selenium-with-chrome/

    Or: https://gist.github.com/chuckbutler/8030755

    0 讨论(0)
  • 2021-01-06 13:21

    I found a solution and wanted to share it. One requirement changed, I am not using PhantomJS anymore but the chromedriver which works headlessly with a virtual framebuffer. Same result and it gets the job done.


    What you need is:

    pip install selenium pyvirtualdisplay

    apt-get install xvfb

    Download ChromeDriver


    I use Py3.5 and a testfile from ovh.net with an tag instead of a button. The script waits for the to be present on the page then clicks it. If you don't wait for the element and are on an async site, the element you try to click might not be there yet. The download location is a folder relative to the scripts location. The script checks that directory if the file is downloaded already with a second delay. If I am not wrong files should be .part during download and as soon as it becomes the .dat specified in filename the script finishes. If you close the virtual framebuffer and driver before the download will not complete. The complete script looks like this:

    # !/usr/bin/python
    # coding: utf-8
    
    import os
    import sys
    import time
    from pyvirtualdisplay import Display
    from selenium import webdriver
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import glob
    
    
    def main(argv):
        url = 'http://ovh.net/files'
        dl_dir = 'downloads'
        filename = '1Mio.dat'
    
        display = Display(visible=0, size=(800, 600))
        display.start()
    
        chrome_options = webdriver.ChromeOptions()
        dl_location = os.path.join(os.getcwd(), dl_dir)
    
        prefs = {"download.default_directory": dl_location}
        chrome_options.add_experimental_option("prefs", prefs)
        chromedriver = "./chromedriver"
        driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)
    
        driver.set_window_size(800, 600)
        driver.get(url)
        WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))
    
        hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
        hyperlink.click()
    
        while not(glob.glob(os.path.join(dl_location, filename))):
            time.sleep(1)
    
        driver.close()
        display.stop()
    
    if __name__ == '__main__':
        main(sys.argv)
    

    I hope this helps someone in the future.

    0 讨论(0)
提交回复
热议问题