How to properly parallelise job heavily relying on I/O

后端 未结 5 1475
耶瑟儿~
耶瑟儿~ 2020-12-02 15:04

I\'m building a console application that have to process a bunch of data.

Basically, the application grabs references from a DB. For each reference, parse the conten

相关标签:
5条回答
  • 2020-12-02 15:10

    The good news is your logic could be easily separated into steps that go into a producer-consumer pipeline.

    • Step 1: Read file
    • Step 2: Parse file
    • Step 3: Write file
    • Step 4: SendToWs

    If you are using .NET 4.0 you can use the BlockingCollection data structure as the backbone for the each step's producer-consumer queue. The main thread will enqueue each work item into step 1's queue where it will be picked up and processed and then forwarded on to step 2's queue and so on and so forth.

    If you are willing to move on to the Async CTP then you can take advantage of the new TPL Dataflow structures for this as well. There is the BufferBlock<T> data structure, among others, that behaves in a similar manner to BlockingCollection and integrates well with the new async and await keywords.

    Because your algorithm is IO bound the producer-consumer strategies may not get you the performance boost you are looking for, but at least you will have a very elegant solution that would scale well if you could increase the IO throughput. I am afraid steps 1 and 3 will be the bottlenecks and the pipeline will not balance well, but it is worth experimenting with.

    0 讨论(0)
  • 2020-12-02 15:10

    I think your approach to split up the list of files and process each file in one batch is ok. My feeling is that you might get more performance gain if you play with degree of parallelism. See: var refs = GetReferencesFromDB().AsParallel().WithDegreeOfParallelism(16); this would start processing 16 files at the same time. Currently you are processing probably 2 or 4 files depending on number of cores you have. This is only efficient when you have only computation without IO. For IO intensive tasks adjustment might bring incredible performance improvements reducing processor idle time.

    If you are going to split up and join tasks back using producer-consumer look at this sample: Using Parallel Linq Extensions to union two sequences, how can one yield the fastest results first?

    0 讨论(0)
  • 2020-12-02 15:19

    Your best bet in these kind of scenario is definitely the producer-consumer model. One thread to pull the data and a bunch of workers to process it. There's no easy way around the I/O so you might as well just focus on optimizing the computation itself.

    I will now try to sketch a model:

    // producer thread
    var refs = GetReferencesFromDB(); // ~5000 Datarow returned
    
    foreach(var ref in refs)
    {
        lock(queue)
        {   
           queue.Enqueue(ref);
           event.Set();
        }
    
        // if the queue is limited, test if the queue is full and wait.
    }
    
    // consumer threads
    while(true)
    {
        value = null;
        lock(queue)
        {
           if(queue.Count > 0)
           {
               value = queue.Dequeue();
           }
        }        
    
        if(value != null) 
           // process value
        else        
           event.WaitOne(); // event to signal that an item was placed in the queue.           
    }
    

    You can find more details about producer/consumer in part 4 of Threading in C#: http://www.albahari.com/threading/part4.aspx

    0 讨论(0)
  • 2020-12-02 15:20

    You're not leveraging any async I/O APIs in any of your code. Everything you're doing is CPU bound and all your I/O operations are going to waste CPU resources blocking. AsParallel is for compute bound tasks, if you want to take advantage of async I/O you need to leverage the Asynchronous Programming Model (APM) based APIs today in <= v4.0. This is done by looking for BeginXXX/EndXXX methods on the I/O based classes you're using and leveraging those whenever available.

    Read this post for starters: TPL TaskFactory.FromAsync vs Tasks with blocking methods

    Next, you don't want to use AsParallel in this case anyway. AsParallel enables streaming which will result in an immediately scheduling a new Task per item, but you don't need/want that here. You'd be much better served by partitioning the work using Parallel::ForEach.

    Let's see how you can use this knowledge to achieve max concurrency in your specific case:

    var refs = GetReferencesFromDB();
    
    // Using Parallel::ForEach here will partition and process your data on separate worker threads
    Parallel.ForEach(
        refs,
        ref =>
    { 
        string filePath = GetFilePath(ref);
    
        byte[] fileDataBuffer = new byte[1048576];
    
        // Need to use FileStream API directly so we can enable async I/O
        FileStream sourceFileStream = new FileStream(
                                          filePath, 
                                          FileMode.Open,
                                          FileAccess.Read,
                                          FileShare.Read,
                                          8192,
                                          true);
    
        // Use FromAsync to read the data from the file
        Task<int> readSourceFileStreamTask = Task.Factory.FromAsync(
                                                 sourceFileStream.BeginRead
                                                 sourceFileStream.EndRead
                                                 fileDataBuffer,
                                                 fileDataBuffer.Length,
                                                 null);
    
        // Add a continuation that will fire when the async read is completed
        readSourceFileStreamTask.ContinueWith(readSourceFileStreamAntecedent =>
        {
            int soureFileStreamBytesRead;
    
            try
            {
                // Determine exactly how many bytes were read 
                // NOTE: this will propagate any potential exception that may have occurred in EndRead
                sourceFileStreamBytesRead = readSourceFileStreamAntecedent.Result;
            }
            finally
            {
                // Always clean up the source stream
                sourceFileStream.Close();
                sourceFileStream = null;
            }
    
            // This is here to make sure you don't end up trying to read files larger than this sample code can handle
            if(sourceFileStreamBytesRead == fileDataBuffer.Length)
            {
                throw new NotSupportedException("You need to implement reading files larger than 1MB. :P");
            }
    
            // Convert the file data to a string
            string html = Encoding.UTF8.GetString(fileDataBuffer, 0, sourceFileStreamBytesRead);
    
            // Parse the HTML
            string convertedHtml = ParseHtml(html);
    
            // This is here to make sure you don't end up trying to write files larger than this sample code can handle
            if(Encoding.UTF8.GetByteCount > fileDataBuffer.Length)
            {
                throw new NotSupportedException("You need to implement writing files larger than 1MB. :P");
            }
    
            // Convert the file data back to bytes for writing
            Encoding.UTF8.GetBytes(convertedHtml, 0, convertedHtml.Length, fileDataBuffer, 0);
    
            // Need to use FileStream API directly so we can enable async I/O
            FileStream destinationFileStream = new FileStream(
                                                   destinationFilePath,
                                                   FileMode.OpenOrCreate,
                                                   FileAccess.Write,
                                                   FileShare.None,
                                                   8192,
                                                   true);
    
            // Use FromAsync to read the data from the file
            Task destinationFileStreamWriteTask = Task.Factory.FromAsync(
                                                      destinationFileStream.BeginWrite,
                                                      destinationFileStream.EndWrite,
                                                      fileDataBuffer,
                                                      0,
                                                      fileDataBuffer.Length,
                                                      null);
    
            // Add a continuation that will fire when the async write is completed
            destinationFileStreamWriteTask.ContinueWith(destinationFileStreamWriteAntecedent =>
            {
                try
                {
                    // NOTE: we call wait here to observe any potential exceptions that might have occurred in EndWrite
                    destinationFileStreamWriteAntecedent.Wait();
                }
                finally
                {
                    // Always close the destination file stream
                    destinationFileStream.Close();
                    destinationFileStream = null;
                }
            },
            TaskContinuationOptions.AttachedToParent);
    
            // Send to external system **concurrent** to writing to destination file system above
            SendToWs(ref, convertedHtml);
        },
        TaskContinuationOptions.AttachedToParent);
    });
    

    Now, here's few notes:

    1. This is sample code so I'm using a 1MB buffer to read/write files. This is excessive for HTML files and wasteful of system resources. You can either lower it to suit your max needs or implement chained reads/writes into a StringBuilder which is an excercise I leave up to you since I'd be writing ~500 more lines of code to do async chained reads/writes. :P
    2. You'll note that on the continuations for the read/write tasks I have TaskContinuationOptions.AttachedToParent. This is very important as it will prevent the worker thread that the Parallel::ForEach starts the work with from completing until all the underlying async calls have completed. If this was not here you would kick off work for all 5000 items concurrently which would pollute the TPL subsystem with thousands of scheduled Tasks and not scale properly at all.
    3. I call SendToWs concurrent to writing the file to the file share here. I don't know what is underlying the implementation of SendToWs, but it too sounds like a good candidate for making async. Right now it's assumed it's pure compute work and, as such, is going to burn a CPU thread while executing. I leave it as an excercise to you to figure out how best to leverage what I've shown you to improve throughput there.
    4. This is all typed free form and my brain was the only compiler here and SO's syntax higlighting is all I used to make sure syntax was good. So, please forgive any syntax errors and let me know if I screwed up anything too badly that you can't make heads or tails of it and I'll follow up.
    0 讨论(0)
  • 2020-12-02 15:20

    Just a suggestion, but have you looked into the Consumer / Producer pattern ? A certain number of threads would read your files on disk and feed the content to a queue. Then another set of threads, known as the consumers, would "consume" the queue as its filled. http://zone.ni.com/devzone/cda/tut/p/id/3023

    0 讨论(0)
提交回复
热议问题