Creating a gzip stream from separately compressed chunks

前端未结

关注

 2  1211

I like to be able to generate a gzip (.gz) file using concurrent CPU threads. I.e., I would be deflating separate chunks from the input file with separately initialized

相关标签:

2条回答

难免孤独

2021-01-07 02:44

The answer is in your question. Each thread has its own deflate instance to produce raw deflate data (see deflateInit2()), which compresses the chunk of the data fed to it, ending with Z_SYNC_FLUSH instead of Z_FINISH. Except for the last chunk of data, which you end with a Z_FINISH. Either way, this ends each resulting stream of compressed data on a byte boundary. Make sure that you get all of the generated data out of deflate(). Then you can concatenate all the compressed data streams. (In the correct order!) Precede that with a gzip header that you generate yourself. It is trivial to do that (see RFC 1952). It can just be a constant 10-byte sequence if you don't need any additional information included in the header (e.g. file name, modification date). The gzip header is not complex.

You can also compute the CRC-32 of each uncompressed chunk in the same thread or a different thread, and combine those CRC-32's using crc32_combine(). You need that for the gzip trailer.

After all of the compressed streams are written, ending with the compressed stream that was ended with a Z_FINISH, you append the gzip trailer. All that is is the four-byte CRC-32 and the low four bytes of the total uncompressed length, both in little-endian order. Eight bytes total.

In each thread you can either use deflateEnd() when done with each chunk, or if you are reusing threads for more chunks, use deflateReset(). I found in pigz that it is much more efficient to leave threads open and deflate instances open in them when processing multiple chunks. Just make sure to use deflateEnd() for the last chunk that thread processes, before closing the thread. Yes, the error from deflateEnd() can be ignored. Just make sure that you've run deflate() until avail_out is not zero to get all of the compressed data.

Doing this, each thread compresses its chunk with no reference to any other uncompressed data, where such references would normally improve the compression when doing it serially. If you want to get more advanced, you can feed each thread the chunk of uncompressed data to compress, and the last 32K of the previous chunk to provide history for the compressor. You do this with deflateSetDictionary().

Still more advanced, you can reduce the number of bytes inserted between compressed streams by sometimes using Z_PARTIAL_FLUSH's until getting to a byte boundary. See pigz for the details of that.

Even more advanced, but slower, you can append compressed streams at the bit level instead of the byte level. That would require shifting every byte of the compressed stream twice to build a new shifted stream. At least for seven out of every eight preceding compressed streams. This eliminates all of the extra bits inserted between compressed streams.

A zlib stream can be generated in exactly the same way, using adler32_combine() for the checksum.

Your question about zlib implies a confusion. The zip format does not use the zlib header and trailer. zip has its own structure, within which is imbedded raw deflate streams. You can use the above approach for those raw deflate streams as well.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2021-01-07 02:52

Sure..

http://zlib.net/pigz/

A parallel implementation of gzip for modern multi-processor, multi-core machines

0 讨论(0)
发布评论:

提交评论
- 加载中...