How to read from file containing multiple GzipStreams

限于喜欢 提交于 2019-11-30 14:38:35

This is a problem with the way GzipStream handles gzip files with multiple gzip entries. It reads the first entry, and treats all succeeding entries as garbage (interestingly, utilities like gzip and winzip handle it correctly by extracting them all into one file).There are a couple of workarounds, or you can use a third-party utility like DotNetZip (http://dotnetzip.codeplex.com/).

Perhaps the easiest is to scan the file for all of the gzip headers, and then manually moving the stream to each one and decompressing the content. This can be done by looking for the ID1, ID2, and 0x8 in the raw file bytes (Deflate compression method, see the specification: http://www.gzip.org/zlib/rfc-gzip.html). This isn't always enough to guarantee that you're looking at a gzip header, so you would want to read the rest of the header (or at least the first ten bytes) in to verify:

    const int Id1 = 0x1F;
    const int Id2 = 0x8B;
    const int DeflateCompression = 0x8;
    const int GzipFooterLength = 8;
    const int MaxGzipFlag = 32; 

    /// <summary>
    /// Returns true if the stream could be a valid gzip header at the current position.
    /// </summary>
    /// <param name="stream">The stream to check.</param>
    /// <returns>Returns true if the stream could be a valid gzip header at the current position.</returns>
    public static bool IsHeaderCandidate(Stream stream)
    {
        // Read the first ten bytes of the stream
        byte[] header = new byte[10];

        int bytesRead = stream.Read(header, 0, header.Length);
        stream.Seek(-bytesRead, SeekOrigin.Current);

        if (bytesRead < header.Length)
        {
            return false;
        }

        // Check the id tokens and compression algorithm
        if (header[0] != Id1 || header[1] != Id2 || header[2] != DeflateCompression)
        {
            return false;
        }

        // Extract the GZIP flags, of which only 5 are allowed (2 pow. 5 = 32)
        if (header[3] > MaxGzipFlag)
        {
            return false;
        }

        // Check the extra compression flags, which is either 2 or 4 with the Deflate algorithm
        if (header[8] != 0x0 && header[8] != 0x2 && header[8] != 0x4)
        {
            return false;
        }

        return true;
    }

Note that GzipStream might move the stream to the end of the file if you use the file stream directly. You may want to read each part into a MemoryStream and then decompress each part individually in memory.

An alternate approach would be to modify the gzip headers to specify the length of the content so that you don't have to scan the file for headers (you could programmatically determine the offset of each), which would require diving a bit deeper into the gzip spec.

This is a bug in GzipStream. Per the RFC 1952 specification for the gzip format:

2.2. File format

A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them.

So a compliant decompressor is required to look for another gzip member immediately after the previous gzip member.

You should be able to simply have a loop that uses the buggy GzipStream to read a single gzip member, and then use GzipStream again to read the next gzip member starting at the first input byte not used by the last use of GzipStream. That would be completely reliable, as opposed to the other suggestion to attempt to search for the start of gzip members.

Compressed data can have any byte pattern at all, so it is possible to be fooled into thinking you have found a gzip header when it is actually part of the compressed data of a gzip member. In fact, one of the deflate methods is to store the data without compression, in which case a gzip stream compressed within a gzip member would likely be stored (since the majority of the data is compressed and therefore very likely cannot be compressed further), and so would present a fully valid faux gzip header in the middle of the compressed data of a gzip member.

The suggestion to use DotNetZip instead is an excellent one. There have been many bugs in GzipStream, some of which were fixed in NET 4.5, and some that obviously have not. It may take Microsoft a few more years to figure out how to get that class written correctly. DotNetZip just works.

I've had a similar problem with DeflateStream.

A simple approach is to wrap your underlying Stream in a Stream implementation which only ever returns a single byte when a call to Read(byte[] buffer, int offset, int count) is made. That thwarts the buffering of the DeflateStream/GZipStream, leaving your underlying stream at the correct position when the end of the first stream is reached. Of course, there's obvious inefficiency here due to the increased number of calls to Read, but that may not be an issue depending on your application.

Poking into the internals of DeflateStream, it might be possible to use reflection to reset the internal Inflater instance.

I've verified that SharpZipLib 0.86.0.518 can read multi-member gzip files:

using (var fileStream = File.OpenRead(filePath))
using (var gz = new GZipInputStream(fileStream))
{
    //Read from gz here
}

You can get it using NuGet.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!