C# Process Files Concurrently And Asynchronously

问题

Looking for some help with best practices on creating a potentially multi-threaded asynchronous application. This application will look through several directories for a certain pattern (configurable per directory). For all of the files it finds in that directory, it will kick off an asynchronous operation for each file (read/write, db operations, api calls, etc). The directories themselves should be processed concurrently as they are un-related to each other.

It's my understanding that Task may not always execute on a separate thread. Because this application may have to handle dozens to hundreds of files at any one time, I want to make sure I am maximizing throughput of the application. It's also worth noting that there may or may not be files in the directory when this application runs.

Is simply using Task enough to accomplish this and achieve maximum throughput, or is there some combination of Parallel.ForEach with an asynchronous function that would be better? Below is what I have created so far just as a test to see and it looks like it's processing 1 directory at a time on the same thread.

Main

class Program {
    static IEnumerable<DirectoryConfig> GetDirectoryConfigs() {
        return new DirectoryConfig[] {
            new DirectoryConfig {
                DirectoryPath = @"PATH_1",
                Token = "*",
                FileProcessor = new FileProcessor()
            },
            new DirectoryConfig {
                DirectoryPath = @"PATH_2",
                Token = "*",
                FileProcessor = new FileProcessor()
            }
        };
    }
    static async Task Main(string[] args) {
        IEnumerable<DirectoryConfig> directoryConfigs = GetDirectoryConfigs();

        List<Task> tasks = new List<Task>();

        foreach(DirectoryConfig config in directoryConfigs) {
            Console.WriteLine("Processing directory {0}", config.DirectoryPath);

            tasks.Add(new DirectoryMonitor().ProcessDirectoryAsync(config));
        }

        await Task.WhenAll(tasks);
    }
}

DirectoryMonitor

class DirectoryMonitor {
    public Task ProcessDirectoryAsync(DirectoryConfig config) {
        List<Task> tasks = new List<Task>();

        foreach (string file in Directory.GetFiles(config.DirectoryPath, config.Token)) {
            tasks.Add(config.FileProcessor.ProcessAsync(file));
        }

        return Task.WhenAll(tasks);
    }
}

FileProcessor

class FileProcessor : IFileProcessor {
    public async Task ProcessAsync(string file) {
        string fileName = Path.GetFileName(file);
        Console.WriteLine("Processing file {0} on thread {1}", fileName, Thread.CurrentThread.ManagedThreadId);
        using (StreamReader reader = new StreamReader(file)) {
            int lineNumber = 0;
            while(!reader.EndOfStream) {
                Console.WriteLine("Reading line {0} of file {1}", ++lineNumber, fileName);
                string line = await reader.ReadLineAsync();

                await DoAsyncWork(line);
            }
        }
    }

    private Task DoAsyncWork(string line) {
        return Task.Delay(1000);
    }
}

回答1:

For this kind of job a powerful tool you could use is the TPL Dataflow library. With this tool you can create a processing pipeline consisting of many linked blocks, with the data flowing from the first block to the last (circles and meshes are also possible).

The advantages of this approach are:

You get data-parallelism on top of task-parallelism. All blocks are working concurrently and independently from each other.
You can configure optimally the level of concurrency (a.k.a. degree of parallelism) of each heterogeneous operation. For example doing API calls may be highly parallelizable, while reading from the hard disk may be not parallelizable at all.
You get advanced options out of the box (BoundedCapacity, CancellationToken and others).
You get built-in support for both synchronous and asynchronous delegates.

Below is how you could rewrite your original code in TPL Dataflow terms. Three blocks are used, two TransformManyBlocks and one ActionBlock.

var directoryBlock = new TransformManyBlock<DirectoryConfig, string>(config =>
{
    return Directory.GetFiles(config.DirectoryPath, config.Token);
});

var fileBlock = new TransformManyBlock<string, string>(filePath =>
{
    return File.ReadLines(filePath);
});

var lineBlock = new ActionBlock<string>(async line =>
{
    await Task.Delay(1000);
}, new ExecutionDataflowBlockOptions()
{
    MaxDegreeOfParallelism = 4
});

directoryBlock.LinkTo(fileBlock, new DataflowLinkOptions { PropagateCompletion = true });
fileBlock.LinkTo(lineBlock, new DataflowLinkOptions { PropagateCompletion = true });

foreach (DirectoryConfig config in GetDirectoryConfigs()) {
    await directoryBlock.SendAsync(config);

directoryBlock.Complete();
await lineBlock.Completion;

This example is not very good since all the work is done by the last block (the lineBlock), and the first two blocks are doing essentially nothing. It is also not memory-efficient since all lines of all files of all directories will soon become queued in the input buffer of the ActionBlock, unless processing the lines happens to be faster than reading them from the disk. You'll need to configure the blocks with BoundedCapacity to solve this problem.

This example also fails to demonstrate how you could have different blocks for different types of files, and link the directoryBlock to all of them using a different filtering predicate for each link:

directoryBlock.LinkTo(csvBlock, filePath => Path.GetExtension(filePath) == "csv");
directoryBlock.LinkTo(xlsBlock, filePath => Path.GetExtension(filePath) == "xls");
directoryBlock.LinkTo(generalFileBlock); // Anything that is neither csv nor xls

There are also other types of blocks you could use, like the TransformBlock and the BatchBlock. The TPL Dataflow is based on the Task Parallel Library (TPL), and it is essentially a high level task-generator that creates and controls the lifecycle of the tasks needed in order to process a workload of given type, based on declarative configuration. It is built-in the .NET Core, and available as a package for .NET Framework.

回答2:

Tasks do indeed not necessarily run on different threads, but the default scheduler and thread pool will do a pretty good job of processing work as fast as possible. You can tweak the defaults (see the second link below), but it’s unlikely (though possible) you will improve the outcome that way. There’s usually little point e.g. running 100s of concurrent threads if your CPU can only execute say 4 operations truly concurrently.

Then there could be contention for the storage device, especially if it’s a spinning disk, which might affect things significantly. Depending on the size of the files it might be more performant to read the whole file in one go rather than stream it in line by line.

Lastly, as no doubt others will also say: try different options and measure. There are a lot of variables (hardware, directory structure, file sizes, the type/complexity/duration of processing you are doing on each file) that could all affect performance, and it’s likely only measuring will determine the best option.

Some reading for further pointers:

https://docs.microsoft.com/en-us/dotnet/standard/threading/managed-threading-best-practices

https://docs.microsoft.com/en-us/dotnet/standard/threading/the-managed-thread-pool#thread-pool-characteristics

来源：https://stackoverflow.com/questions/62602684/c-sharp-process-files-concurrently-and-asynchronously

标签

async-await