问题
I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks()
method from boto3
. The strategy of ungzipping the file chunk-by-chunk originates from this issue.
The code is as follows:
dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
data = dec.decompress(chunk)
print(len(chunk), len(data))
# 524288 65505
# 524288 0
# 524288 0
# ...
This code initially prints out the value of 65505
followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.
Is there something I'm missing?
回答1:
It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.
GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.
So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.
# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
# decompress this chunk of data
data = dec.decompress(chunk)
# bgzip is a concatenation of gzip files
# if there is stuff in this chunk beyond the current block
# it needs to be processed
while len(dec.unused_data):
# end of one block
leftovers = dec.unused_data
# create a new decompressor
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
#decompress the leftovers
data = data+dec.decompress(leftovers)
# TODO handle data
来源:https://stackoverflow.com/questions/61048597/ungzipping-chunks-of-bytes-from-from-s3-using-iter-chunks