How to check empty gzip file in Python

后端 未结 8 854
太阳男子
太阳男子 2021-01-12 01:46

I don\'t want to use OS commands as that makes it is OS dependent.

This is available in tarfile, tarfile.is_tarfile(filename), to check if

相关标签:
8条回答
  • 2021-01-12 02:24

    Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:

    Fixed Regions

    First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.

    This looks as follows (from the link above):

    +---+---+---+---+---+---+---+---+---+---+
    |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
    +---+---+---+---+---+---+---+---+---+---+
    

    Variable Regions

    However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).

    If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:

    (if FLG.FEXTRA set)
    +---+---+=================================+
    | XLEN  |...XLEN bytes of "extra field"...| (more-->)
    +---+---+=================================+
    

    After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):

    (if FLG.FNAME set)
    +=========================================+
    |...original file name, zero-terminated...| (more-->)
    +=========================================+
    

    We then have comments:

    (if FLG.FCOMMENT set)
    +===================================+
    |...file comment, zero-terminated...| (more-->)
    +===================================+
    

    And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.

    Solution

    So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe (gzip -c "Compress this data" > myfile.gz), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:

    import gzip
    
    def check_null(path):
        '''
        Returns an empty string for a null file, which is falsey, 
        and returns a non-empty string otherwise (which is truthey)
        '''
    
        with gzip.GzipFile(path, 'rb') as f:
            return f.read(1)
    

    This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.

    import contextlib       # python3 only, use a try/except block for Py2
    import pandas as pd
    
    with contexlib.suppress(pd.parser.CParserError as error):
        df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
        # do something here
    
    0 讨论(0)
  • 2021-01-12 02:28

    I had a few hundred thousand gzip files, only a few of which are zero-sized, mounted on a network share. I was forced to use the following optimization. It is brittle, but in the (very frequent) case in which you have a large number of files generated using the same method, the sum of all the bytes other than the name of the payload are a constant.

    Then you can check for a zero-sized payload by:

    1. Computing that constant over one file. You can code it up, but I find it simpler to just use command-line gzip (and this whole answer is an ugly hack anyway).
    2. examining only the inode for the rest of the files, instead of opening each file, which can be orders of magnitude faster:
    from os import stat
    from os.path import basename
    
    # YMMV with len_minus_file_name
    def is_gzip_empty(file_name, len_minus_file_name=23): 
        return os.stat(file_name).st_size - len(basename(file_name)) == len_minus_file_name
    

    This could break in many ways. Caveat emptor. Only use it if other methods are not practical.

    0 讨论(0)
提交回复
热议问题