Downloading and unzipping a .zip file without writing to disk

前端 未结 9 1525
囚心锁ツ
囚心锁ツ 2020-12-02 04:42

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.

相关标签:
9条回答
  • 2020-12-02 05:14

    I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.

    from io import BytesIO
    from zipfile import ZipFile
    import urllib.request
        
    url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
    
    with ZipFile(BytesIO(url.read())) as my_zip_file:
        for contained_file in my_zip_file.namelist():
            # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
            for line in my_zip_file.open(contained_file).readlines():
                print(line)
                # output.write(line)
    

    Necessary changes:

    • There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
    • urlopen:
      • "The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.

    Note:

    • In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.

    A few minor changes I made:

    • I use with ... as instead of zipfile = ... according to the Docs.
    • The script now uses .namelist() to cycle through all the files in the zip and print their contents.
    • I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
    • I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
      • Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
    • I am using an example file, the UN/LOCODE text archive:

    What I didn't do:

    • NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.

    Here's a way:

    import urllib.request
    import shutil
    
    with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
        shutil.copyfileobj(response, out_file)
    
    0 讨论(0)
  • 2020-12-02 05:15

    write to a temporary file which resides in RAM

    it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:

    tempfile.SpooledTemporaryFile([max_size=0[, mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]]])

    This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().

    The resulting file has one additional method, rollover(), which causes the file to roll over to an on-disk file regardless of its size.

    The returned object is a file-like object whose _file attribute is either a StringIO object or a true file object, depending on whether rollover() has been called. This file-like object can be used in a with statement, just like a normal file.

    New in version 2.6.

    or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

    0 讨论(0)
  • 2020-12-02 05:15

    Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.

    Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:

    from StringIO import StringIO
    from zipfile import ZipFile
    from urllib import urlopen
    import pandas
    
    url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
    zf = ZipFile(StringIO(url.read()))
    for item in zf.namelist():
        print("File in zip: "+  item)
    # find the first matching csv file in the zip:
    match = [s for s in zf.namelist() if ".csv" in s][0]
    # the first line of the file contains a string - that line shall de ignored, hence skiprows
    df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
    

    (Note, I use Python 2.7.13)

    This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library

    Python 3 Version

    from io import BytesIO
    from zipfile import ZipFile
    import pandas
    import requests
    
    url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
    content = requests.get(url)
    zf = ZipFile(BytesIO(content.content))
    
    for item in zf.namelist():
        print("File in zip: "+  item)
    
    # find the first matching csv file in the zip:
    match = [s for s in zf.namelist() if ".csv" in s][0]
    # the first line of the file contains a string - that line shall de     ignored, hence skiprows
    df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
    
    0 讨论(0)
  • 2020-12-02 05:20

    Adding on to the other answers using requests:

     # download from web
    
     import requests
     url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
     content = requests.get(url)
    
     # unzip the content
     from io import BytesIO
     from zipfile import ZipFile
     f = ZipFile(BytesIO(content.content))
     print(f.namelist())
    
     # outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
    

    Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.

    0 讨论(0)
  • 2020-12-02 05:23

    I'd like to add my Python3 answer for completeness:

    from io import BytesIO
    from zipfile import ZipFile
    import requests
    
    def get_zip(file_url):
        url = requests.get(file_url)
        zipfile = ZipFile(BytesIO(url.content))
        zip_names = zipfile.namelist()
        if len(zip_names) == 1:
            file_name = zip_names.pop()
            extracted_file = zipfile.open(file_name)
            return extracted_file
        return [zipfile.open(file_name) for file_name in zip_names]
    
    0 讨论(0)
  • 2020-12-02 05:26

    It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.

    from StringIO import StringIO
    from zipfile import ZipFile
    from urllib import urlopen
    
    def unzip_string(zipped_string):
        unzipped_string = ''
        zipfile = ZipFile(StringIO(zipped_string))
        for name in zipfile.namelist():
            unzipped_string += zipfile.open(name).read()
        return unzipped_string
    
    0 讨论(0)
提交回复
热议问题