Ungzipping chunks of bytes from from S3 using iter_chunks()

问题

I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.

The code is as follows:

dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
    data = dec.decompress(chunk)
    print(len(chunk), len(data))

# 524288 65505
# 524288 0
# 524288 0
# ...

This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.

Is there something I'm missing?

回答1:

It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.

GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.

So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.

# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
    # decompress this chunk of data
    data = dec.decompress(chunk)
    # bgzip is a concatenation of gzip files
    # if there is stuff in this chunk beyond the current block
    # it needs to be processed
    while len(dec.unused_data):
        # end of one block
        leftovers = dec.unused_data
        # create a new decompressor
        dec = zlib.decompressobj(32+zlib.MAX_WBITS)
        #decompress the leftovers
        data = data+dec.decompress(leftovers)
    # TODO handle data

来源：https://stackoverflow.com/questions/61048597/ungzipping-chunks-of-bytes-from-from-s3-using-iter-chunks

标签

python

amazon-s3

gzip

boto3

zlib