Read from StreamReader in batches

问题

I have been running into OutOfMemory Exceptions while trying to load an 800MB text file into a DataTable via StreamReader. I was wondering if there a way to load the DataTable from the memory stream in batches, ie, read the first 10,000 rows of the text file from StreamReader, create DataTable, do something with DataTable, then load the next 10,000 rows into the StreamReader and so on.

My googles weren't very helpful here, but it seems like there should be an easy way to do this. Ultimately I will be writing the DataTables to an MS SQL db using SqlBulkCopy so if there is an easier approach than what I have described, I would be thankful for a quick pointer in the right direction.

Edit - Here is the code that I am running:

public static DataTable PopulateDataTableFromText(DataTable dt, string txtSource)
{

    StreamReader sr = new StreamReader(txtSource);
    DataRow dr;
    int dtCount = dt.Columns.Count;
    string input;
    int i = 0;

    while ((input = sr.ReadLine()) != null)
    {

        try
        {
            string[] stringRows = input.Split(new char[] { '\t' });
            dr = dt.NewRow();
            for (int a = 0; a < dtCount; a++)
            {
                string dataType = dt.Columns[a].DataType.ToString();
                if (stringRows[a] == "" && (dataType == "System.Int32" || dataType == "System.Int64"))
                {
                    stringRows[a] = "0";
                }
                dr[a] = Convert.ChangeType(stringRows[a], dt.Columns[a].DataType);

            }
            dt.Rows.Add(dr);
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
        }
        i++;
    }
    return dt;
}

And here is the error that is returned:

"System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.String.Split(Char[] separator, Int32 count, StringSplitOptions options)
at System.String.Split(Char[] separator}
at Harvester.Config.PopulateDataTableFromText(DataTable dt, String txtSource) in C:...."

Regarding the suggestion to load the data directly into SQL - I'm a bit of a noob when it comes to C# but I thought that is basically what I am doing? SqlBulkCopy.WriteToServer takes the DataTable that I create from the text file and imports it to sql. Is there an even easier way to do this that I am missing?

Edit: Oh, I forgot to mention - this code will not be running on the same server as the SQL Server. The Data text file is on Server B and needs to be written to table in Server A. Does that preclude using bcp?

回答1:

Do you actually need to process the data by batches of rows ? Or could you process it row by row ? In the latter case, I think Linq could be very helpful here, because it makes it easy to stream data across a "pipeline" of methods. That way you don't need to load a lot of data at once, only one row at a time

First, you need to make your StreamReader enumerable. This is easily done with an extension method:

public static class TextReaderExtensions
{
    public static IEnumerable<string> Lines(this TextReader reader)
    {
        string line;
        while((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

That way you can use the StreamReader as the source for a Linq query.

Then you need a method that takes a string and converts it to a DataRow:

DataRow ParseDataRow(string input)
{
    // Your parsing logic here
    ...
}

With those elements, you can easily project each line from the file to a DataRow, and do whatever you need with it:

using (var reader = new StreamReader(fileName))
{
    var rows = reader.Lines().Select(ParseDataRow);
    foreach(DataRow row in rows)
    {
        // Do something with the DataRow
    }
}

(note that you could do something similar with a simple loop, without using Linq, but I think Linq makes the code more readable...)

回答2:

Have you considered loading the data directly into SQL Server and then manipulating it in the database? The database engine is already designed to perform manipulation of large volumes of data in an efficient manner. This may yield better results overall and allows you to leverage the capabilities of the database and SQL language to do the heavy lifting. It's the old "work smarter not harder" principle.

There are a number of different methods to load data into SQL Server, so you may want to examine these to see if any are a good fit. If you are using SQLServer 2005 or later and you really need to do some manipulation on the data in C#, you can always use a managed stored procedure.

Something to realize here is that the OutOfMemoryException is a bit misleading. Memory is more than just the amount of physical RAM you have. What you are likely running out of is addressable memory. This is a very different thing.

When you load a large file into memory and transform it into a DataTable it likely requires a lot more than just 800Mb to represent the same data. Since 32bit .NET processes are limited to just under 2Gb of addressable memory, you will likely never be able to process this quantity of data in a single batch.

What you will likely need to do is to process the data in a streaming manner. In other words, don't try to load it all into a DataTable and then bulk insert to SQLServer. Rather process the file in chunks, clearing out the prior set of rows once you're done with them.

Now, if you have access to a 64-bit machine with lots of memory (to avoid VM thrashing) and a copy of the 64-bit .NET runtime, you could probably get away within running the code unchanged. But I would suggest making the necessary changes anyways since it will likely improve the performance of this even in that environment.

回答3:

SqlBulkCopy.WriteToServer has an overload that accepts an IDataReader. You can implement your own IDataReader as a wrapper around the StreamReader where the Read() method will consume a single line from the StreamReader. This way the data will be "streamed" into the database instead of trying to build it up in memory as a DataTable first. Hope that helps.

回答4:

As an update to the other answers here, I was researching this too, and came across this page which provides a great C# example on reading a text file by chunks, processing in parallel, and then bulk inserting into a database.

The crux of the code is within this loop:

//Of note: it's faster to read all the lines we are going to act on and 
            //then process them in parallel instead of reading and processing line by line.
            //Code source: http://cc.davelozinski.com/code/c-sharp-code/read-lines-in-batches-process-in-parallel
            while (blnFileHasMoreLines)
            {
                batchStartTime = DateTime.Now;  //Reset the timer

                //Read in all the lines up to the BatchCopy size or
                //until there's no more lines in the file
                while (intLineReadCounter < BatchSize && !tfp.EndOfData)
                {
                    CurrentLines[intLineReadCounter] = tfp.ReadFields();
                    intLineReadCounter += 1;
                    BatchCount += 1;
                    RecordCount += 1;
                }

                batchEndTime = DateTime.Now;    //record the end time of the current batch
                batchTimeSpan = batchEndTime - batchStartTime;  //get the timespan for stats

                //Now process each line in parallel.
                Parallel.For(0, intLineReadCounter, x =>
                //for (int x=0; x < intLineReadCounter; x++)    //Or the slower single threaded version for debugging
                {
                    List<object> values = null; //so each thread gets its own copy. 

                    if (tfp.TextFieldType == FieldType.Delimited)
                    {
                        if (CurrentLines[x].Length != CurrentRecords.Columns.Count)
                        {
                            //Do what you need to if the number of columns in the current line
                            //don't match the number of expected columns
                            return; //stop now and don't add this record to the current collection of valid records.
                        }

                        //Number of columns match so copy over the values into the datatable
                        //for later upload into a database
                        values = new List<object>(CurrentRecords.Columns.Count);
                        for (int i = 0; i < CurrentLines[x].Length; i++)
                            values.Add(CurrentLines[x][i].ToString());

                        //OR do your own custom processing here if not using a database.
                    }
                    else if (tfp.TextFieldType == FieldType.FixedWidth)
                    {
                        //Implement your own processing if the file columns are fixed width.
                    }

                    //Now lock the data table before saving the results so there's no thread bashing on the datatable
                    lock (oSyncLock)
                    {
                        CurrentRecords.LoadDataRow(values.ToArray(), true);
                    }

                    values.Clear();

                }
                ); //Parallel.For   

                //If you're not using a database, you obviously won't need this next piece of code.
                if (BatchCount >= BatchSize)
                {   //Do the SQL bulk copy and save the info into the database
                    sbc.BatchSize = CurrentRecords.Rows.Count;
                    sbc.WriteToServer(CurrentRecords);

                    BatchCount = 0;         //Reset these values
                    CurrentRecords.Clear(); //  "
                }

                if (CurrentLines[intLineReadCounter] == null)
                    blnFileHasMoreLines = false;    //we're all done, so signal while loop to stop

                intLineReadCounter = 0; //reset for next pass
                Array.Clear(CurrentLines, 0, CurrentLines.Length);

            } //while blnhasmorelines

来源：https://stackoverflow.com/questions/3816789/read-from-streamreader-in-batches

标签

streamreader