C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion

问题

I'm working on building a C# program to read the RDF data in the Google Freebase data dump. To start out, I've written a simple loop to simply read the file and get a count of the Triples. However, instead of getting the 1.9 billion count as stated in the documentation page (referred above), my program is counting only about 11.5 million and then exiting. The relevant portion of the source code is given below (takes about 30 seconds to run).

What am I missing here?

// Simple reading through the gz file
try
{
    using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
    {
        int tupleCount = 0;
        string readLine = "";

        using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
        {
            StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);

            while (true)
            {
                readLine = sr.ReadLine();
                if (readLine != null)
                {
                    tupleCount++;
                    if (tupleCount % 1000000 == 0)
                    { Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
                }
                else
                { break; }
            }
            Console.WriteLine("Tuples: " + tupleCount.ToString());
        }
    }
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }

(I tried using GZippedNTriplesParser in dotNetRdf to read the data by building on this recommendation, but that seems to be choking on an RdfParseException right at the beginning (Tab delimiters? UTF-8??). So, for the moment, trying to roll my own).

回答1:

The Freebase RDF dumps are built by a map/reduce job that outputs 200 individual Gzip files. Those 200 files are then concatenated into one final Gzip file. According to the Gzip spec, concatenating the raw bytes from multiple Gzip files will produce a valid Gzip file. A library that adheres to the spec should produce a single file with concatenated content of each input file when uncompressing that file.

Based on the number of triples that you're seeing, I'm guessing that your code is only uncompressing the first chunk of the file and ignoring the other 199. I'm not much of a C# programmer but from reading another Stackoverflow answer it seems like switching to DotNetZip will solve this problem.

回答2:

I'm use DotNetZip and create decoration class GzipDecorator for "gzipped chunks" workaround.

sealed class GzipDecorator : Stream
{
    private readonly Stream _readStream;
    private GZipStream _gzip;
    private long _totalIn;
    private long _totalOut;

    public GzipDecorator(Stream readStream)
    {
        Throw.IfArgumentNull(readStream, "readStream");
        _readStream = readStream;
        _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        var bytesRead = _gzip.Read(buffer, offset, count);
        if (bytesRead <= 0 && _readStream.Position < _readStream.Length)
        {
            _totalIn += _gzip.TotalIn + 18;
            _totalOut += _gzip.TotalOut;
            _gzip.Dispose();
            _readStream.Position = _totalIn;
            _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
            bytesRead = _gzip.Read(buffer, offset, count);
        }
        return bytesRead;
    }
}

回答3:

I managed to solve the problem by repacking dump using "7-zip" archiver. Maybe it helps you.

来源：https://stackoverflow.com/questions/21868658/c-sharp-parsing-of-freebase-rdf-dump-yields-only-11-5-million-n-triples-instead

标签

rdf

freebase