I don\'t want to use OS commands as that makes it is OS dependent.
This is available in tarfile
, tarfile.is_tarfile(filename)
, to check if
Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:
Fixed Regions
First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.
This looks as follows (from the link above):
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Variable Regions
However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).
If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
We then have comments:
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.
Solution
So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe (gzip -c "Compress this data" > myfile.gz
), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:
import gzip
def check_null(path):
'''
Returns an empty string for a null file, which is falsey,
and returns a non-empty string otherwise (which is truthey)
'''
with gzip.GzipFile(path, 'rb') as f:
return f.read(1)
This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.
import contextlib # python3 only, use a try/except block for Py2
import pandas as pd
with contexlib.suppress(pd.parser.CParserError as error):
df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
# do something here
I had a few hundred thousand gzip files, only a few of which are zero-sized, mounted on a network share. I was forced to use the following optimization. It is brittle, but in the (very frequent) case in which you have a large number of files generated using the same method, the sum of all the bytes other than the name of the payload are a constant.
Then you can check for a zero-sized payload by:
from os import stat
from os.path import basename
# YMMV with len_minus_file_name
def is_gzip_empty(file_name, len_minus_file_name=23):
return os.stat(file_name).st_size - len(basename(file_name)) == len_minus_file_name
This could break in many ways. Caveat emptor. Only use it if other methods are not practical.