Python: Creating a streaming gzip'd file-like?

前端 未结 5 1499
面向向阳花
面向向阳花 2020-12-24 07:06

I\'m trying to figure out the best way to compress a stream with Python\'s zlib.

I\'ve got a file-like input stream (input, below) and an o

相关标签:
5条回答
  • 2020-12-24 07:46

    Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.

    from gzip import GzipFile
    from collections import deque
    
    
    CHUNK = 16 * 1024
    
    
    class Buffer (object):
        def __init__ (self):
            self.__buf = deque()
            self.__size = 0
        def __len__ (self):
            return self.__size
        def write (self, data):
            self.__buf.append(data)
            self.__size += len(data)
        def read (self, size=-1):
            if size < 0: size = self.__size
            ret_list = []
            while size > 0 and len(self.__buf):
                s = self.__buf.popleft()
                size -= len(s)
                ret_list.append(s)
            if size < 0:
                ret_list[-1], remainder = ret_list[-1][:size], ret_list[-1][size:]
                self.__buf.appendleft(remainder)
            ret = ''.join(ret_list)
            self.__size -= len(ret)
            return ret
        def flush (self):
            pass
        def close (self):
            pass
    
    
    class GzipCompressReadStream (object):
        def __init__ (self, fileobj):
            self.__input = fileobj
            self.__buf = Buffer()
            self.__gzip = GzipFile(None, mode='wb', fileobj=self.__buf)
        def read (self, size=-1):
            while size < 0 or len(self.__buf) < size:
                s = self.__input.read(CHUNK)
                if not s:
                    self.__gzip.close()
                    break
                self.__gzip.write(s)
            return self.__buf.read(size)
    

    Advantages:

    • Avoids repeated string concatenation, which would cause the entire string to be copied repeatedly.
    • Reads a fixed CHUNK size from the input stream, instead of reading whole lines at a time (which can be arbitrarily long).
    • Avoids circular references.
    • Avoids misleading public "write" method of GzipCompressStream(), which is really only used internally.
    • Takes advantage of name mangling for internal member variables.
    0 讨论(0)
  • 2020-12-24 07:51

    It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using gzip instead of zlib directly.

    Basically, GzipWrap is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)

    Of course, it produces binary so there was no sense in implementing "readline".

    You should be able to expand it to cover other cases or to be used as an iterable object itself.

    from gzip import GzipFile
    
    class GzipWrap(object):
        # input is a filelike object that feeds the input
        def __init__(self, input, filename = None):
            self.input = input
            self.buffer = ''
            self.zipper = GzipFile(filename, mode = 'wb', fileobj = self)
    
        def read(self, size=-1):
            if (size < 0) or len(self.buffer) < size:
                for s in self.input:
                    self.zipper.write(s)
                    if size > 0 and len(self.buffer) >= size:
                        self.zipper.flush()
                        break
                else:
                    self.zipper.close()
                if size < 0:
                    ret = self.buffer
                    self.buffer = ''
            else:
                ret, self.buffer = self.buffer[:size], self.buffer[size:]
            return ret
    
        def flush(self):
            pass
    
        def write(self, data):
            self.buffer += data
    
        def close(self):
            self.input.close()
    
    0 讨论(0)
  • 2020-12-24 07:52

    This works (at least in python 3):

    with s3.open(path, 'wb') as f:
        gz = gzip.GzipFile(filename, 'wb', 9, f)
        gz.write(b'hello')
        gz.flush()
        gz.close()
    

    Here it writes to s3fs's file object with a gzip compression on it. The magic is the f parameter, which is GzipFile's fileobj. You have to provide a file name for gzip's header.

    0 讨论(0)
  • 2020-12-24 07:53

    Use the cStringIO (or StringIO) module in conjunction with zlib:

    >>> import zlib
    >>> from cStringIO import StringIO
    >>> s.write(zlib.compress("I'm a lumberjack"))
    >>> s.seek(0)
    >>> zlib.decompress(s.read())
    "I'm a lumberjack"
    
    0 讨论(0)
  • 2020-12-24 07:59

    The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.

    Update

    This answer does not work. Example:

    # tmp/try-gzip.py 
    import sys
    import gzip
    
    fd=gzip.GzipFile(fileobj=sys.stdin)
    sys.stdout.write(fd.read())
    

    output:

    ===> cat .bash_history  | python tmp/try-gzip.py  > tmp/history.gzip
    Traceback (most recent call last):
      File "tmp/try-gzip.py", line 7, in <module>
        sys.stdout.write(fd.read())
      File "/usr/lib/python2.7/gzip.py", line 254, in read
        self._read(readsize)
      File "/usr/lib/python2.7/gzip.py", line 288, in _read
        pos = self.fileobj.tell()   # Save current position
    IOError: [Errno 29] Illegal seek
    
    0 讨论(0)
提交回复
热议问题