I\'m writing a Python program to extract data from the middle of a 6 GB bz2 file. A bzip2 file is made up of independently decryptable blocks of data, so I only need to find
how come crc32("\x00") is not 0x00000000?
The basic CRC algorithm is to treat the input message as a polynomial in GF(2), divide by the fixed CRC polynomial, and use the polynomial remainder as the resulting hash.
CRC-32 makes a number of modifications on the basic algorithm:
Let's work out the CRC-32 of the one-byte string 0x00:
And there you have it: The CRC-32 of 0x00 is 0xD202EF8D.
(You should verify this.)
In addition to the one-shot decompress
function, the bz2 module also contains a class BZ2Decompressor
that decompresses data as it is fed to the decompress method. It therefore does not care about the end-of-file checksum and provides the data needed once it reaches the end of the block.
To illustrate, assume I have located the block I wish to extract from the file and stored it in a bitarray.bitarray instance (other bit-twiddling modules will probably work as well). Then this function will decode it:
def bunzip2_block(block):
from bz2 import BZ2Decompressor
from bitarray import bitarray
dummy_file = bitarray(endian="big")
dummy_file.frombytes("BZh9")
dummy_file += block
decompressor = BZ2Decompressor()
return decompressor.decompress(dummy_file.tobytes())
Note that the frombytes
and tobytes
methods of bitarray were previously called fromstring
and tostring
.