Why is my TPL Dataflow Pipeline slower in reading huge CSV files compared to just looping?

问题

So my requirement is to read multiple CSV files (each having a minimum of a million rows) and then parse each line. Currently, the way I have broken up my pipeline, I am first creating a separate pipeline to just read a CSV file into a string[] and then I plan to create the parsing pipeline later.

But seeing the results of my File Reading Pipeline, I am dumbfounded because it is considerably slower than just looping through the CSV file and then looping through the rows.

static public IPropagatorBlock<string, string[]> CreatePipeline(int batchSize)
    {

        var lineBufferBlock = new BufferBlock<string>(new DataflowBlockOptions { BoundedCapacity = batchSize});

        var fileReadingBlock = new ActionBlock<string>(async (filePath) =>
        {
            using (var fileStream = File.OpenRead(filePath)) {
                using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, batchSize)) {
                    string line;
                    while ((line = streamReader.ReadLine()) != null) {
                        var isCompleted = await lineBufferBlock.SendAsync(line);
                        while (!isCompleted)
                        {
                            isCompleted = await lineBufferBlock.SendAsync(line);
                        }
                    }
                }
            }
        }, new ExecutionDataflowBlockOptions { EnsureOrdered = true, MaxDegreeOfParallelism = Environment.ProcessorCount});

        var fileParsingBlock = new TransformBlock<string, string[]>((line) =>
        {
            return line.Split(",");
        }, new ExecutionDataflowBlockOptions { EnsureOrdered = true, MaxDegreeOfParallelism = Environment.ProcessorCount});

        lineBufferBlock.LinkTo(fileParsingBlock, new DataflowLinkOptions { PropagateCompletion = true});

        fileReadingBlock.Completion.ContinueWith((task) =>
        {
            lineBufferBlock.Complete();
        });

        return DataflowBlock.Encapsulate(fileReadingBlock, fileParsingBlock);

    }

And then I finally consume it as follows

        for (int i = 1; i < 5; i++) {
            var filePath = $"C:\\Users\\File{i}.csv";
            fileReadingPipeline.SendAsync(filePath);
        }
        fileReadingPipeline.Complete();
        while (true) {
            try {
                var outputRows = fileReadingPipeline.Receive();
                foreach (string word in outputRows)
                {

                }
            }
            catch (InvalidOperationException e) {
                break;
            }
        }

Whereas my straight loop code is the following:

        for (int i = 1; i < 5; i++) {

            var filePath = $"C:\\Users\\File{i}.csv";
            foreach (string row in File.ReadLines(filePath))
            {
                foreach (string word in row.Split(","))
                {

                }

            }

        }

The difference in performance comes down to ~15 seconds for TPL Dataflow whereas it is ~5s for the looping code.

EDIT

On better advice from the comments, I have removed the unnecessary lineBufferBlock from the pipeline and this is my code now. However performance still remains the same.

            var fileReadingBlock = new TransformManyBlock<string, string>((filePath) =>
        {
            return File.ReadLines(filePath);
        }, new ExecutionDataflowBlockOptions { EnsureOrdered = true, MaxDegreeOfParallelism = Environment.ProcessorCount});

        var fileParsingBlock = new TransformBlock<string, string[]>((line) =>
        {
            return line.Split(",");
        }, new ExecutionDataflowBlockOptions { EnsureOrdered = true, MaxDegreeOfParallelism = Environment.ProcessorCount});

        fileReadingBlock.LinkTo(fileParsingBlock, new DataflowLinkOptions { PropagateCompletion = true});

        return DataflowBlock.Encapsulate(fileReadingBlock, fileParsingBlock);

回答1:

When you configure a pipeline, you should have in mind the capabilities of the hardware that is going to do the job. The TPL Dataflow is not doing the job by itself, it's delegating it to the CPU, the HDD/SSD, the network card etc. For example when reading files from a hard disk, it is probably futile to instruct the TPL to read data from 8 files concurrently, because the head of the mechanical arm of the HDD can not be physically located in 8 places at the same time. This boils down to the fact that reading files from filesystems is not particularly parallel-friendly. It is slightly better in case of SSDs, but you'll have to test it in a case by case basis.

Another issue with parallelization is granularity. You want the workload to be chunky, not granular. Otherwise the cost of passing messages from buffer to buffer, and putting memory barriers around each transfer to ensure cross-thread visibility, may negate any benefits you may expect from employing parallelism. Tip: splitting a single string to parts is a highly granular operation.

Here is a way to do it:

using static MoreLinq.Extensions.BatchExtension;

var reader = new TransformManyBlock<string, string[]>(filePath =>
{
    return File.ReadLines(filePath).Batch(100, r => r.ToArray());
}, new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = 1
});

var parser = new TransformBlock<string[], string[][]>(lines =>
{
    return lines.Select(line => line.Split(",")).ToArray();
}, new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount
});

reader.LinkTo(parser, new DataflowLinkOptions { PropagateCompletion = true });

This example uses the Batch operator from the MoreLinq package in order to pass the lines around in batches of 100, instead of passing them one by one. You can find other batching options here.

Update: One more suggestion is to boost the minimum number of threads that the ThreadPool creates on demand (SetMinThreads). Otherwise the ThreadPool will be immediately saturated by the MaxDegreeOfParallelism = Environment.ProcessorCount configuration, which will cause small but noticeable (500 msec) delays, because of the intentional laziness of the ThreadPool's thread-injection algorithm.

ThreadPool.SetMinThreads(Environment.ProcessorCount * 2,
    Environment.ProcessorCount * 2);

It is enough to call this method once at the start of the program.

来源：https://stackoverflow.com/questions/65006462/why-is-my-tpl-dataflow-pipeline-slower-in-reading-huge-csv-files-compared-to-jus

标签

.net

concurrency

tpl-dataflow