My company uses a legacy file format for Electromiography data, which is no longer in production. However, there is some interest in maintaining retro-compatibility, so I am
zlib is a thin wrapper around data compressed with the DEFLATE algorithm and is defined in RFC1950:
A zlib stream has the following structure: 0 1 +---+---+ |CMF|FLG| (more-->) +---+---+ (if FLG.FDICT set) 0 1 2 3 +---+---+---+---+ | DICTID | (more-->) +---+---+---+---+ +=====================+---+---+---+---+ |...compressed data...| ADLER32 | +=====================+---+---+---+---+
So it adds at least two, possibly six bytes before and 4 bytes with an ADLER32 checksum after the raw DEFLATE compressed data.
The first byte contains the CMF (Compression Method and flags), which is split into CM (Compression method) (first 4 bits) and CINFO (Compression info) (last 4 bits).
From this it's quite clear that unfortunately already the first two bytes of a zlib stream can vary a lot depending on what compression method and settings have been used.
Luckily, I stumbled upon a post by Mark Adler, the author of the ADLER32 algorithm, where he lists the most common and less common combinations of those two starting bytes.
With that out of the way, let's look at how we can use Python to examine zlib:
>>> import zlib
>>> msg = 'foo'
>>> [hex(ord(b)) for b in zlib.compress(msg)]
['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']
So the zlib data created by Python's zlib
module (using default options) starts with
78 9c
. We'll use that to create a script that writes a custom file format
cointaining a preamble, some zlib compressed data and a footer.
We then write a second script that scans a file for that two byte pattern, starts decompressing everything that follows as a zlib stream and figures out where the stream ends and the footer starts.
create.py
import zlib
msg = 'foo'
filename = 'foo.compressed'
compressed_msg = zlib.compress(msg)
data = 'HEADER' + compressed_msg + 'FOOTER'
with open(filename, 'wb') as outfile:
outfile.write(data)
Here we take msg
, compress it with zlib, and surround it with a header and
footer before we write it out to a file.
Header and footer are of fixed length in this example, but they could of course have arbitrary, unknown lengths.
Now for the script that tries to find a zlib stream in such a file. Because for
this example we know exactly what marker to expect I'm using only one, but
obviously the list ZLIB_MARKERS
could be filled with all the markers from the
post mentioned above.
ident.py
import zlib
ZLIB_MARKERS = ['\x78\x9c']
filename = 'foo.compressed'
infile = open(filename, 'r')
data = infile.read()
pos = 0
found = False
while not found:
window = data[pos:pos+2]
for marker in ZLIB_MARKERS:
if window == marker:
found = True
start = pos
print "Start of zlib stream found at byte %s" % pos
break
if pos == len(data):
break
pos += 1
if found:
header = data[:start]
rest_of_data = data[start:]
decomp_obj = zlib.decompressobj()
uncompressed_msg = decomp_obj.decompress(rest_of_data)
footer = decomp_obj.unused_data
print "Header: %s" % header
print "Message: %s" % uncompressed_msg
print "Footer: %s" % footer
if not found:
print "Sorry, no zlib streams starting with any of the markers found."
The idea is this:
Start at the beginning of the file and create a two byte search window.
Move the search window forward in one-byte increments.
For every window check if it matches any of the two byte markers we defined.
If a match is found, record the starting position, stop searching and try to decompress everything that follows.
Now, finding the end of the stream isn't as trivial as looking for two marker bytes. zlib streams are neither terminated by a fixed byte sequence nor is their length indicated in any of the header fields. Instead it's terminated by a four byte ADLER32 checksum that must match the data up to this point.
The way it works is that the internal C function inflate()
continously keeps
trying to decompress the stream as it reads it, and if it comes across a
matching checksum, signals that to its caller, indicating that the rest of the
data isn't part of the zlib stream anymore.
In Python this behavior is exposed when using decompression objects instead of simply
calling zlib.decompress()
. Calling decompress(string)
on a Decompress
object
will decompress a zlib stream in string
and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data
and can be
retrieved afterwards.
This should produce the following output on a file created with the first script:
Start of zlib stream found at byte 6
Header: HEADER
Message: foo
Footer: FOOTER
The example can easily be modified to write the uncompressed message to a file instead of printing it. Then you can further analyze the formerly zlib compressed data, and try to identify known fields in the metadata in the header and footer you separated out.
To start, why not scan the files for all valid zip streams (it's good enough for small files and to figure out the format):
import zlib
from glob import glob
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
while i < len(data):
try:
zo = zlib.decompressobj()
yield i, zo.decompress(data[i:])
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
for filename in glob('*.mio'):
print(filename)
for i, data in zipstreams(filename):
print (i, len(data))
Looks like the data streams contain little-endian double precision floating point data:
import numpy
from matplotlib import pyplot
for filename in glob('*.mio'):
for i, data in zipstreams(filename):
if data:
a = numpy.fromstring(data, '<f8')
pyplot.plot(a[1:])
pyplot.title(filename + ' - %i' % i)
pyplot.show()