Concatenate multiple zlib compressed data streams into a single stream efficiently

前端 未结 3 351
孤城傲影
孤城傲影 2020-12-20 18:59

If I have several binary strings with compressed zlib data, is there a way to efficiently combine them into a single compressed string without decompressing everyth

相关标签:
3条回答
  • 2020-12-20 19:32

    In addition to gzjoin which requires decompression of the first deflate stream, you can take a look at gzlog.h and gzlog.c, which efficiently appends short strings to a gzip file without having to decompress the deflate stream each time. (It can be easily modified to operate on zlib-wrapped deflate data instead of gzip-wrapped deflate data.) You would use this approach if you are in control of the creation of the first deflate stream. If you are not creating the first deflate stream, then you would have to use the approach of gzjoin which requires decompression.

    None of the approaches require recompression.

    0 讨论(0)
  • 2020-12-20 19:33

    Since you don't mind venturing into C, you can start by looking at the code for gzjoin.

    Note, the gzjoin code has to decompress to find the parts that have to change when merged, but it doesn't have to recompress. That's not too bad because decompression is typically faster than compression.

    0 讨论(0)
  • 2020-12-20 19:46

    I'm just turning @zorlak's comment into an answer and adding some code so I can find it later.

    If you can control the initial compression of your streams, you can store the length of the uncompressed data, its Adler-32 checksum, and the compressed data somewhere. Later you can then concatenate the individual streams in an arbitrary order.

    Note that I am not sure if the individual streams can have different compression levels, compression strategies, or window sizes since the concatenate function strips the zlib header of all but the first stream...

    from typing import Tuple
    import zlib
    
    
    def prepare(data: bytes) -> Tuple[int, bytes, int]:
        deflate = zlib.compressobj()
        result = deflate.compress(data)
        result += deflate.flush(zlib.Z_SYNC_FLUSH)
        return len(data), result, zlib.adler32(data)
    
    
    def concatenate(*chunks: Tuple[int, bytes, int]) -> bytes:
        if not chunks:
            return b''
        _, result, final_checksum = chunks[0]
        for length, chunk, checksum in chunks[1:]:
            result += chunk[2:]  # strip the zlib header
            final_checksum = adler32_combine(final_checksum, checksum, length)
        result += b'\x03\x00'  # insert a final empty block
        result += final_checksum.to_bytes(4, byteorder='big')
        return result
    
    
    def adler32_combine(adler1: int, adler2: int, length2: int) -> int:
        # Python implementation of adler32_combine
        # The orignal C implementation is Copyright (C) 1995-2011, 2016 Mark Adler
        # see https://github.com/madler/zlib/blob/master/adler32.c#L143
        BASE = 65521
        WORD = 0xffff
        DWORD = 0xffffffff
        if adler1 < 0 or adler1 > DWORD:
            raise ValueError('adler1 must be between 0 and 2^32')
        if adler2 < 0 or adler2 > DWORD:
            raise ValueError('adler2 must be between 0 and 2^32')
        if length2 < 0:
            raise ValueError('length2 must not be negative')
    
        remainder = length2 % BASE
        sum1 = adler1 & WORD
        sum2 = (remainder * sum1) % BASE
        sum1 += (adler2 & WORD) + BASE - 1
        sum2 += ((adler1 >> 16) & WORD) + ((adler2 >> 16) & WORD) + BASE - remainder
        if sum1 >= BASE:
            sum1 -= BASE
        if sum1 >= BASE:
            sum1 -= BASE
        if sum2 >= (BASE << 1):
            sum2 -= (BASE << 1)
        if sum2 >= BASE:
            sum2 -= BASE
    
        return (sum1 | (sum2 << 16))
    

    A quick example:

    hello = prepare(b'Hello World! ')
    test = prepare(b'This is a test. ')
    fox = prepare(b'The quick brown fox jumped over the lazy dog. ')
    dawn = prepare(b'We ride at dawn! ')
    
    # these all print what you would expect
    print(zlib.decompress(concatenate(hello, test, fox, dawn)))
    print(zlib.decompress(concatenate(dawn, fox, test, hello)))
    print(zlib.decompress(concatenate(fox, hello, dawn, test)))
    print(zlib.decompress(concatenate(test, dawn, hello, fox)))
    
    0 讨论(0)
提交回复
热议问题