Extract zlib compressed data from binary file in python

后端 未结 2 1531
失恋的感觉
失恋的感觉 2020-12-17 02:12

My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am

相关标签:
2条回答
  • 2020-12-17 02:57

    zlib is a thin wrapper around data compressed with the DEFLATE algorithm and is defined in RFC1950:

      A zlib stream has the following structure:
    
           0   1
         +---+---+
         |CMF|FLG|   (more-->)
         +---+---+
    
      (if FLG.FDICT set)
    
           0   1   2   3
         +---+---+---+---+
         |     DICTID    |   (more-->)
         +---+---+---+---+
    
         +=====================+---+---+---+---+
         |...compressed data...|    ADLER32    |
         +=====================+---+---+---+---+
    

    So it adds at least two, possibly six bytes before and 4 bytes with an ADLER32 checksum after the raw DEFLATE compressed data.

    The first byte contains the CMF (Compression Method and flags), which is split into CM (Compression method) (first 4 bits) and CINFO (Compression info) (last 4 bits).

    From this it's quite clear that unfortunately already the first two bytes of a zlib stream can vary a lot depending on what compression method and settings have been used.

    Luckily, I stumbled upon a post by Mark Adler, the author of the ADLER32 algorithm, where he lists the most common and less common combinations of those two starting bytes.

    With that out of the way, let's look at how we can use Python to examine zlib:

    >>> import zlib
    >>> msg = 'foo'
    >>> [hex(ord(b)) for b in zlib.compress(msg)]
    ['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']
    

    So the zlib data created by Python's zlib module (using default options) starts with 78 9c. We'll use that to create a script that writes a custom file format cointaining a preamble, some zlib compressed data and a footer.

    We then write a second script that scans a file for that two byte pattern, starts decompressing everything that follows as a zlib stream and figures out where the stream ends and the footer starts.

    create.py

    import zlib
    
    msg = 'foo'
    filename = 'foo.compressed'
    
    compressed_msg = zlib.compress(msg)
    data = 'HEADER' + compressed_msg + 'FOOTER'
    
    with open(filename, 'wb') as outfile:
        outfile.write(data)
    

    Here we take msg, compress it with zlib, and surround it with a header and footer before we write it out to a file.

    Header and footer are of fixed length in this example, but they could of course have arbitrary, unknown lengths.

    Now for the script that tries to find a zlib stream in such a file. Because for this example we know exactly what marker to expect I'm using only one, but obviously the list ZLIB_MARKERS could be filled with all the markers from the post mentioned above.

    ident.py

    import zlib
    
    ZLIB_MARKERS = ['\x78\x9c']
    filename = 'foo.compressed'
    
    infile = open(filename, 'r')
    data = infile.read()
    
    pos = 0
    found = False
    
    while not found:
        window = data[pos:pos+2]
        for marker in ZLIB_MARKERS:
            if window == marker:
                found = True
                start = pos
                print "Start of zlib stream found at byte %s" % pos
                break
        if pos == len(data):
            break
        pos += 1
    
    if found:
        header = data[:start]
    
        rest_of_data = data[start:]
        decomp_obj = zlib.decompressobj()
        uncompressed_msg = decomp_obj.decompress(rest_of_data)
    
        footer = decomp_obj.unused_data
    
        print "Header: %s" % header
        print "Message: %s" % uncompressed_msg
        print "Footer: %s" % footer
    
    if not found:
        print "Sorry, no zlib streams starting with any of the markers found."
    

    The idea is this:

    • Start at the beginning of the file and create a two byte search window.

    • Move the search window forward in one-byte increments.

    • For every window check if it matches any of the two byte markers we defined.

    • If a match is found, record the starting position, stop searching and try to decompress everything that follows.

    Now, finding the end of the stream isn't as trivial as looking for two marker bytes. zlib streams are neither terminated by a fixed byte sequence nor is their length indicated in any of the header fields. Instead it's terminated by a four byte ADLER32 checksum that must match the data up to this point.

    The way it works is that the internal C function inflate() continously keeps trying to decompress the stream as it reads it, and if it comes across a matching checksum, signals that to its caller, indicating that the rest of the data isn't part of the zlib stream anymore.

    In Python this behavior is exposed when using decompression objects instead of simply calling zlib.decompress(). Calling decompress(string) on a Decompress object will decompress a zlib stream in string and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data and can be retrieved afterwards.

    This should produce the following output on a file created with the first script:

    Start of zlib stream found at byte 6
    Header: HEADER
    Message: foo
    Footer: FOOTER
    

    The example can easily be modified to write the uncompressed message to a file instead of printing it. Then you can further analyze the formerly zlib compressed data, and try to identify known fields in the metadata in the header and footer you separated out.

    0 讨论(0)
  • 2020-12-17 03:13

    To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):

    import zlib
    from glob import glob
    
    def zipstreams(filename):
        """Return all zip streams and their positions in file."""
        with open(filename, 'rb') as fh:
            data = fh.read()
        i = 0
        while i < len(data):
            try:
                zo = zlib.decompressobj()
                yield i, zo.decompress(data[i:])
                i += len(data[i:]) - len(zo.unused_data)
            except zlib.error:
                i += 1
    
    for filename in glob('*.mio'):
        print(filename)
        for i, data in zipstreams(filename):
            print (i, len(data))
    

    Looks like the data streams contain little-endian double precision floating point data:

    import numpy
    from matplotlib import pyplot
    
    for filename in glob('*.mio'):
        for i, data in zipstreams(filename):
            if data:
                a = numpy.fromstring(data, '<f8')
                pyplot.plot(a[1:])
                pyplot.title(filename + ' - %i' % i)
                pyplot.show()
    
    0 讨论(0)
提交回复
热议问题