How to determine the Content-Length of a gzipped file in Python?

女生的网名这么多〃 提交于 2019-12-12 13:55:25

问题


I've a big compressed file and I want to know the size of the content without uncompress it. I've tried this:

import gzip
import os

with gzip.open(data_file) as f:
          f.seek(0, os.SEEK_END)
          size = f.tell()

but I get this error

ValueError: Seek from end not supported 

How can I do that?

Thx.


回答1:


It is not possible in principle to definitively determine the size of the uncompressed data in a gzip file without decompressing it. You do not need to have the space to store the uncompressed data -- you can discard it as you go along. But you have to decompress it all.

If you control the source of the gzip file and can assure that a) there are no concatenated members in the gzip file, b) the uncompressed data is less than 4 GB in length, and c) there is no extraneous junk at the end of the gzip file, then and only then you can read the last four bytes of the gzip file to get a little-endian integer that has the length of the uncompressed data.

See this answer for more details.

Here is Python code to read a gzip file and print the uncompressed length, without having to store or save the uncompressed data. It limits the memory usage to small buffers. This requires Python 3.3 or greater:

#!/usr/local/bin/python3.4
import sys
import zlib
import warnings
f = open(sys.argv[1], "rb")
total = 0
buf = f.read(1024)
while True:             # loop through concatenated gzip streams
    z = zlib.decompressobj(15+16)
    while True:         # loop through one gzip stream
        while True:     # go through all output from one input buffer
            total += len(z.decompress(buf, 4096))
            buf = z.unconsumed_tail
            if buf == b"":
                break
        if z.eof:
            break       # end of a gzip stream found
        buf = f.read(1024)
        if buf == b"":
            warnings.warn("incomplete gzip stream")
            break
    buf = z.unused_data
    z = None
    if buf == b"":
        buf == f.read(1024)
        if buf == b"":
            break
print(total)



回答2:


Unfortunately, the Python 2.x gzip module doesn't appear to support any way of determining uncompressed file size.

However, gzip does store the uncompressed file size as a little-endian 32-bit unsigned integer at the very end of the file: http://www.abeel.be/content/determine-uncompressed-size-gzip-file

Unfortunately, this only works for files <4gb in size due to using only a 32-bit integer the gzip format; see the manual.

import os
import struct

with open(data_file,"rb") as f:
    f.seek(-4, os.SEEK_END)
    size, = struct.unpack("<I", f.read(4))
    print size



回答3:


To summerize, I need to open huges compressed files (> 4GB) so the technique of Dan won't work and I want the length (number of line) of the file so the technique of Mark Adler is not appropriate.

Eventually, I found for uncompressed files a solution( not the most optimized but it works!) which can be transposed easily to compressed files:

size = 0

with gzip.open(data_file) as f:
  for line in f:
    size+= 1
    pass

return size

Thank you all, people in this forum are very effective!



来源:https://stackoverflow.com/questions/24332295/how-to-determine-the-content-length-of-a-gzipped-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!