Concatenate multiple zlib compressed data streams into a single stream efficiently

前端未结

关注

 3  351

If I have several binary strings with compressed zlib data, is there a way to efficiently combine them into a single compressed string without decompressing everyth

相关标签:

3条回答

忘掉有多难

2020-12-20 19:32

In addition to gzjoin which requires decompression of the first deflate stream, you can take a look at gzlog.h and gzlog.c, which efficiently appends short strings to a gzip file without having to decompress the deflate stream each time. (It can be easily modified to operate on zlib-wrapped deflate data instead of gzip-wrapped deflate data.) You would use this approach if you are in control of the creation of the first deflate stream. If you are not creating the first deflate stream, then you would have to use the approach of gzjoin which requires decompression.

None of the approaches require recompression.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-20 19:33

Since you don't mind venturing into C, you can start by looking at the code for gzjoin.

Note, the gzjoin code has to decompress to find the parts that have to change when merged, but it doesn't have to recompress. That's not too bad because decompression is typically faster than compression.

0 讨论(0)
发布评论:

提交评论
- 加载中...

误落风尘

2020-12-20 19:46

I'm just turning @zorlak's comment into an answer and adding some code so I can find it later.

If you can control the initial compression of your streams, you can store the length of the uncompressed data, its Adler-32 checksum, and the compressed data somewhere. Later you can then concatenate the individual streams in an arbitrary order.

Note that I am not sure if the individual streams can have different compression levels, compression strategies, or window sizes since the concatenate function strips the zlib header of all but the first stream...

from typing import Tuple
import zlib


def prepare(data: bytes) -> Tuple[int, bytes, int]:
    deflate = zlib.compressobj()
    result = deflate.compress(data)
    result += deflate.flush(zlib.Z_SYNC_FLUSH)
    return len(data), result, zlib.adler32(data)


def concatenate(*chunks: Tuple[int, bytes, int]) -> bytes:
    if not chunks:
        return b''
    _, result, final_checksum = chunks[0]
    for length, chunk, checksum in chunks[1:]:
        result += chunk[2:]  # strip the zlib header
        final_checksum = adler32_combine(final_checksum, checksum, length)
    result += b'\x03\x00'  # insert a final empty block
    result += final_checksum.to_bytes(4, byteorder='big')
    return result


def adler32_combine(adler1: int, adler2: int, length2: int) -> int:
    # Python implementation of adler32_combine
    # The orignal C implementation is Copyright (C) 1995-2011, 2016 Mark Adler
    # see https://github.com/madler/zlib/blob/master/adler32.c#L143
    BASE = 65521
    WORD = 0xffff
    DWORD = 0xffffffff
    if adler1 < 0 or adler1 > DWORD:
        raise ValueError('adler1 must be between 0 and 2^32')
    if adler2 < 0 or adler2 > DWORD:
        raise ValueError('adler2 must be between 0 and 2^32')
    if length2 < 0:
        raise ValueError('length2 must not be negative')

    remainder = length2 % BASE
    sum1 = adler1 & WORD
    sum2 = (remainder * sum1) % BASE
    sum1 += (adler2 & WORD) + BASE - 1
    sum2 += ((adler1 >> 16) & WORD) + ((adler2 >> 16) & WORD) + BASE - remainder
    if sum1 >= BASE:
        sum1 -= BASE
    if sum1 >= BASE:
        sum1 -= BASE
    if sum2 >= (BASE << 1):
        sum2 -= (BASE << 1)
    if sum2 >= BASE:
        sum2 -= BASE

    return (sum1 | (sum2 << 16))

A quick example:

hello = prepare(b'Hello World! ')
test = prepare(b'This is a test. ')
fox = prepare(b'The quick brown fox jumped over the lazy dog. ')
dawn = prepare(b'We ride at dawn! ')

# these all print what you would expect
print(zlib.decompress(concatenate(hello, test, fox, dawn)))
print(zlib.decompress(concatenate(dawn, fox, test, hello)))
print(zlib.decompress(concatenate(fox, hello, dawn, test)))
print(zlib.decompress(concatenate(test, dawn, hello, fox)))

0 讨论(0)