Ungzipping chunks of bytes from from S3 using iter_chunks()

夙愿已清 提交于 2020-06-13 05:00:30

问题


I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.

The code is as follows:

dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
    data = dec.decompress(chunk)
    print(len(chunk), len(data))

# 524288 65505
# 524288 0
# 524288 0
# ...

This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.

Is there something I'm missing?


回答1:


It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.

GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.

So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.

# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
    # decompress this chunk of data
    data = dec.decompress(chunk)
    # bgzip is a concatenation of gzip files
    # if there is stuff in this chunk beyond the current block
    # it needs to be processed
    while len(dec.unused_data):
        # end of one block
        leftovers = dec.unused_data
        # create a new decompressor
        dec = zlib.decompressobj(32+zlib.MAX_WBITS)
        #decompress the leftovers
        data = data+dec.decompress(leftovers)
    # TODO handle data


来源:https://stackoverflow.com/questions/61048597/ungzipping-chunks-of-bytes-from-from-s3-using-iter-chunks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!