Is there a way to check if a buffer is in Brotli compressed format?

问题

I'm an intern doing research into whether using Brotli compression in a piece of software provides a performance boost over the current release, which uses GZip.

My task is to change anything using GZip to use Brotli compression instead. One function I need to replace does a check to test if a buffer contains data that was compressed using GZip. It does this by checking the stream identifier at the beginning and end:

bool isGzipped() const
{
    // Gzip file signature (0x1f8b)
    return
        (_bufferEnd >= _bufferStart + 2) &&
        (static_cast<unsigned char>(_bufferStart[0]) == 0x1f) &&
        (static_cast<unsigned char>(_bufferStart[1]) == 0x8b);
}

I want to create similar function bool isBrotliEncoded(). I was wondering if there is a similar quick check that can can be done with Brotli encoded buffers? I've had a look at the byte values for some of the compressed files that brotli produces, but I can't find a rule that holds for all of them. Some start with 0x5B, some with 0x1B, compression of empty files results in 0x06, and files that have been compressed multiple times start with a range of different values. The end of each file is also inconsistent.

The only way I know of to test if it is in the correct format is to attempt decompression and wait for an error, which defeats the purpose of doing this test.

So my question is: Does anyone know how to check if a buffer has been compressed with Brotli without attempting decompression and waiting for failure?

回答1:

Unfortunately, the raw brotli format is not well suited to such detection, even when simply trying to decompress and waiting for an error.

I ran a trial of one million brotli decompressions of random data. About 5% of them checked out as good brotli streams. So you've already got a problem right there. 3.5% of the million are a single byte, since there are nine one-byte values that are each a valid brotli stream. The mean length of the random valid streams was almost a megabyte.

For those in which an error was detected (about 95% of the million cases), 3.5% went more than a megabyte before the error was detected. 1.4% went more than ten megabytes. The mean number of random bytes before finding an error was 309 KB. Another problem.

In short, the probability of a false positive is relatively high, and the number of bytes to process to find a negative can be quite large.

If you are writing this software, then you should put your own header before the brotli data to aid in detection. Or you can use the brotli framing format that I developed at their request, which has a unique four-byte header before the brotli compressed stream. That would reduce the probability of a false positive dramatically.

回答2:

Brotli is formally defined in RFC 7932. The format of the data stream is covered in Section 2: Compressed Representation Overview and Section 9: Compressed Data Format. Brotli does not employ leading/trailing identifiers like gzip does, but it does consist of a sequence of uncompressed headers and commands that describe the compressed data. They are not all aligned on byte boundaries, you have to parse them at the bit level instead (Brotli is processed as a stream of bits and bytes). Refer to Section 10: Decoding Algorithm for how to read these headers. If you parse out a few headers that follow the Brotli format without error then it is a good bet that you are dealing with a Brotli compressed buffer.

来源：https://stackoverflow.com/questions/39008957/is-there-a-way-to-check-if-a-buffer-is-in-brotli-compressed-format

标签

c++

compression

brotli