问题
I've a big compressed file and I want to know the size of the content without uncompress it. I've tried this:
import gzip
import os
with gzip.open(data_file) as f:
f.seek(0, os.SEEK_END)
size = f.tell()
but I get this error
ValueError: Seek from end not supported
How can I do that?
Thx.
回答1:
It is not possible in principle to definitively determine the size of the uncompressed data in a gzip file without decompressing it. You do not need to have the space to store the uncompressed data -- you can discard it as you go along. But you have to decompress it all.
If you control the source of the gzip file and can assure that a) there are no concatenated members in the gzip file, b) the uncompressed data is less than 4 GB in length, and c) there is no extraneous junk at the end of the gzip file, then and only then you can read the last four bytes of the gzip file to get a little-endian integer that has the length of the uncompressed data.
See this answer for more details.
Here is Python code to read a gzip file and print the uncompressed length, without having to store or save the uncompressed data. It limits the memory usage to small buffers. This requires Python 3.3 or greater:
#!/usr/local/bin/python3.4
import sys
import zlib
import warnings
f = open(sys.argv[1], "rb")
total = 0
buf = f.read(1024)
while True: # loop through concatenated gzip streams
z = zlib.decompressobj(15+16)
while True: # loop through one gzip stream
while True: # go through all output from one input buffer
total += len(z.decompress(buf, 4096))
buf = z.unconsumed_tail
if buf == b"":
break
if z.eof:
break # end of a gzip stream found
buf = f.read(1024)
if buf == b"":
warnings.warn("incomplete gzip stream")
break
buf = z.unused_data
z = None
if buf == b"":
buf == f.read(1024)
if buf == b"":
break
print(total)
回答2:
Unfortunately, the Python 2.x gzip module doesn't appear to support any way of determining uncompressed file size.
However, gzip
does store the uncompressed file size as a little-endian 32-bit unsigned integer at the very end of the file: http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Unfortunately, this only works for files <4gb in size due to using only a 32-bit integer the gzip
format; see the manual.
import os
import struct
with open(data_file,"rb") as f:
f.seek(-4, os.SEEK_END)
size, = struct.unpack("<I", f.read(4))
print size
回答3:
To summerize, I need to open huges compressed files (> 4GB) so the technique of Dan won't work and I want the length (number of line) of the file so the technique of Mark Adler is not appropriate.
Eventually, I found for uncompressed files a solution( not the most optimized but it works!) which can be transposed easily to compressed files:
size = 0
with gzip.open(data_file) as f:
for line in f:
size+= 1
pass
return size
Thank you all, people in this forum are very effective!
来源:https://stackoverflow.com/questions/24332295/how-to-determine-the-content-length-of-a-gzipped-file-in-python