If I have several binary strings with compressed zlib data, is there a way to efficiently combine them into a single compressed string without decompressing everyth
In addition to gzjoin which requires decompression of the first deflate stream, you can take a look at gzlog.h and gzlog.c, which efficiently appends short strings to a gzip file without having to decompress the deflate stream each time. (It can be easily modified to operate on zlib-wrapped deflate data instead of gzip-wrapped deflate data.) You would use this approach if you are in control of the creation of the first deflate stream. If you are not creating the first deflate stream, then you would have to use the approach of gzjoin which requires decompression.
None of the approaches require recompression.
Since you don't mind venturing into C, you can start by looking at the code for gzjoin.
Note, the gzjoin code has to decompress to find the parts that have to change when merged, but it doesn't have to recompress. That's not too bad because decompression is typically faster than compression.
I'm just turning @zorlak's comment into an answer and adding some code so I can find it later.
If you can control the initial compression of your streams, you can store the length of the uncompressed data, its Adler-32 checksum, and the compressed data somewhere. Later you can then concatenate the individual streams in an arbitrary order.
Note that I am not sure if the individual streams can have different compression levels, compression strategies, or window sizes since the concatenate
function strips the zlib header of all but the first stream...
from typing import Tuple
import zlib
def prepare(data: bytes) -> Tuple[int, bytes, int]:
deflate = zlib.compressobj()
result = deflate.compress(data)
result += deflate.flush(zlib.Z_SYNC_FLUSH)
return len(data), result, zlib.adler32(data)
def concatenate(*chunks: Tuple[int, bytes, int]) -> bytes:
if not chunks:
return b''
_, result, final_checksum = chunks[0]
for length, chunk, checksum in chunks[1:]:
result += chunk[2:] # strip the zlib header
final_checksum = adler32_combine(final_checksum, checksum, length)
result += b'\x03\x00' # insert a final empty block
result += final_checksum.to_bytes(4, byteorder='big')
return result
def adler32_combine(adler1: int, adler2: int, length2: int) -> int:
# Python implementation of adler32_combine
# The orignal C implementation is Copyright (C) 1995-2011, 2016 Mark Adler
# see https://github.com/madler/zlib/blob/master/adler32.c#L143
BASE = 65521
WORD = 0xffff
DWORD = 0xffffffff
if adler1 < 0 or adler1 > DWORD:
raise ValueError('adler1 must be between 0 and 2^32')
if adler2 < 0 or adler2 > DWORD:
raise ValueError('adler2 must be between 0 and 2^32')
if length2 < 0:
raise ValueError('length2 must not be negative')
remainder = length2 % BASE
sum1 = adler1 & WORD
sum2 = (remainder * sum1) % BASE
sum1 += (adler2 & WORD) + BASE - 1
sum2 += ((adler1 >> 16) & WORD) + ((adler2 >> 16) & WORD) + BASE - remainder
if sum1 >= BASE:
sum1 -= BASE
if sum1 >= BASE:
sum1 -= BASE
if sum2 >= (BASE << 1):
sum2 -= (BASE << 1)
if sum2 >= BASE:
sum2 -= BASE
return (sum1 | (sum2 << 16))
A quick example:
hello = prepare(b'Hello World! ')
test = prepare(b'This is a test. ')
fox = prepare(b'The quick brown fox jumped over the lazy dog. ')
dawn = prepare(b'We ride at dawn! ')
# these all print what you would expect
print(zlib.decompress(concatenate(hello, test, fox, dawn)))
print(zlib.decompress(concatenate(dawn, fox, test, hello)))
print(zlib.decompress(concatenate(fox, hello, dawn, test)))
print(zlib.decompress(concatenate(test, dawn, hello, fox)))