How to check empty gzip file in Python

后端未结

关注

 8  863

太阳男子

I don\'t want to use OS commands as that makes it is OS dependent.

This is available in tarfile, tarfile.is_tarfile(filename), to check if

相关标签:

8条回答

失恋的感觉

2021-01-12 02:24
Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:

Fixed Regions

First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.

This looks as follows (from the link above):
```
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
```
Variable Regions

However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).

If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:
```
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN  |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
```
After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):
```
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
```
We then have comments:
```
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
```
And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.

Solution

So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe (gzip -c "Compress this data" > myfile.gz), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:
```
import gzip

def check_null(path):
    '''
    Returns an empty string for a null file, which is falsey, 
    and returns a non-empty string otherwise (which is truthey)
    '''

    with gzip.GzipFile(path, 'rb') as f:
        return f.read(1)
```
This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.
```
import contextlib       # python3 only, use a try/except block for Py2
import pandas as pd

with contexlib.suppress(pd.parser.CParserError as error):
    df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
    # do something here
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-01-12 02:28
I had a few hundred thousand gzip files, only a few of which are zero-sized, mounted on a network share. I was forced to use the following optimization. It is brittle, but in the (very frequent) case in which you have a large number of files generated using the same method, the sum of all the bytes other than the name of the payload are a constant.

Then you can check for a zero-sized payload by:
1. Computing that constant over one file. You can code it up, but I find it simpler to just use command-line gzip (and this whole answer is an ugly hack anyway).
2. examining only the inode for the rest of the files, instead of opening each file, which can be orders of magnitude faster:
```
from os import stat
from os.path import basename

# YMMV with len_minus_file_name
def is_gzip_empty(file_name, len_minus_file_name=23): 
    return os.stat(file_name).st_size - len(basename(file_name)) == len_minus_file_name
```
This could break in many ways. Caveat emptor. Only use it if other methods are not practical.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2