Python unzipping stream of bytes?

前端 未结 4 656
时光取名叫无心
时光取名叫无心 2020-11-27 15:34

Here is the situation:

  • I get gzipped xml documents from Amazon S3

    import boto
    from boto.s3.connection import S3Connection
    from boto.s3.key i         
    
    
            
相关标签:
4条回答
  • 2020-11-27 16:21

    I had to do the same thing and this is how I did it:

    import gzip
    f = StringIO.StringIO()
    k.get_file(f)
    f.seek(0) #This is crucial
    gzf = gzip.GzipFile(fileobj=f)
    file_content = gzf.read()
    
    0 讨论(0)
  • 2020-11-27 16:22

    For Python3x and boto3-

    So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

    import io
    import zipfile
    import boto3
    import sys
    
    s3 = boto3.resource('s3', 'us-east-1')
    
    
    def stream_zip_file():
        count = 0
        obj = s3.Object(
            bucket_name='MonkeyBusiness',
            key='/Daily/Business/Banana/{current-date}/banana.zip'
        )
        buffer = io.BytesIO(obj.get()["Body"].read())
        print (buffer)
        z = zipfile.ZipFile(buffer)
        foo2 = z.open(z.infolist()[0])
        print(sys.getsizeof(foo2))
        line_counter = 0
        for _ in foo2:
            line_counter += 1
        print (line_counter)
        z.close()
    
    
    if __name__ == '__main__':
        stream_zip_file()
    
    0 讨论(0)
  • 2020-11-27 16:23

    You can try PIPE and read contents without downloading file

        import subprocess
        c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE,         stderr=subprocess.PIPE)
        for row in c.stdout:
          print row
    

    In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

    0 讨论(0)
  • 2020-11-27 16:36

    Yes, you can use the zlib module to decompress byte streams:

    import zlib
    
    def stream_gzip_decompress(stream):
        dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
        for chunk in stream:
            rv = dec.decompress(chunk)
            if rv:
                yield rv
    

    The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

    The S3 key object is an iterator, so you can do:

    for data in stream_gzip_decompress(k):
        # do something with the decompressed data
    
    0 讨论(0)
提交回复
热议问题