How to tell if a file is gzip compressed?

前端 未结 6 1331
挽巷
挽巷 2021-01-03 19:21

I have a Python program which is going to take text files as input. However, some of these files may be gzip compressed.

Is there a cross-platform, usable from Py

相关标签:
6条回答
  • 2021-01-03 19:48

    The magic number for gzip compressed files is 1f 8b. Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.

    Usually gzip compressed files sport the suffix .gz though. Even gzip(1) itself won't unpack files without it unless you --force it to. You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).

    One problem with your approach is, that gzip.GzipFile() will not throw an exception if you feed it an uncompressed file. Only a later read() will. This means, that you would probably have to implement some of your program logic twice. Ugly.

    0 讨论(0)
  • 2021-01-03 19:52

    Doesn’t seem to work well in python3...

    import mimetypes
    filename = "./datasets/test"
    
    def file_type(filename):
        type = mimetypes.guess_type(filename)
        return type
    print(file_type(filename))
    

    returns (None, None) But from the unix command "File"

    :~> file datasets/test datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015

    0 讨论(0)
  • 2021-01-03 19:55

    Import the mimetypes module. It can automatically guess what kind of file you have, and if it is compressed.

    i.e.

    mimetypes.guess_type('blabla.txt.gz')
    

    returns:

    ('text/plain', 'gzip')

    0 讨论(0)
  • 2021-01-03 19:56

    gzip itself will raise an OSError if it's not a gzipped file.

    >>> with gzip.open('README.md', 'rb') as f:
    ...     f.read()
    ...
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 276, in read
        return self._buffer.read(size)
      File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 463, in read
        if not self._read_gzip_header():
      File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 411, in _read_gzip_header
        raise OSError('Not a gzipped file (%r)' % magic)
    OSError: Not a gzipped file (b'# ')
    

    Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.

    import pathlib
    
    if '.gz' in pathlib.Path(filepath).suffixes:
       # some more inexpensive checks until confident we can attempt to decompress
       # ...
       try ...
         ...
       except OSError as e:
         ...
    
    0 讨论(0)
  • 2021-01-03 20:03

    As of python3.7, this works

    import gzip
    with gzip.open(input_file, 'r') as fh:
        try:
            fh.read(1)
        except OSError:
            print('input_file is not a valid gzip file by OSError')
    

    As of python3.8, this also works:

    import gzip
    with gzip.open(input_file, 'r') as fh:
        try:
            fh.read(1)
        except gzip.BadGzipFile:
            print('input_file is not a valid gzip file by BadGzipFile')
    
    0 讨论(0)
  • 2021-01-03 20:07

    "Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?"

    The accepted answer got me 90% of the way to the pretty reliable solution (test if first two bytes are 1f 8b), but did not show how to actually do this in Python. Here is one possible way:

    def is_gz_file(filepath):
        with open(filepath, 'rb') as test_f:
            return test_f.read(2) == b'\x1f\x8b'
    
    0 讨论(0)
提交回复
热议问题