TPL Dataflow vs plain Semaphore

前端 未结 2 797
故里飘歌
故里飘歌 2021-02-10 23:09

I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the datab

2条回答
  •  走了就别回头了
    2021-02-10 23:30

    Here are the selling points of the Semaphore approach:

    1. Simplicity

    And here are the selling points of the TPL Dataflow approach:

    1. Task-parallelism on top of data-parallelism
    2. Optimal utilization of resources (bandwidth, CPU, database connections)
    3. Configurable degree of parallelism for each of the heterogeneous operations
    4. Reduced memory footprint

    Let's review the following Semaphore implementation for example:

    string[] urls = FetchUrlsFromDB();
    var cts = new CancellationTokenSource();
    var semaphore = new SemaphoreSlim(10); // Degree of parallelism (DOP)
    Task[] tasks = urls.Select(url => Task.Run(async () =>
    {
        await semaphore.WaitAsync(cts.Token);
        try
        {
            string rawData = DownloadData(url);
            var data = Deserialize(rawData);
            PersistToCRM(data);
            MarkAsCompleted(url);
        }
        finally
        {
            semaphore.Release();
        }
    })).ToArray();
    Task.WaitAll(tasks);
    

    The above implementation ensures that at most 10 urls will be processed concurrently at any given moment. There will be no coordination between these parallel workflows though. So for example it is entirely possible that at a given moment all 10 parallel workflows will be downloading data, at another moment all 10 will be deserializing raw data, and at another moment all 10 will be persisting data to the CRM. This is far from ideal. Ideally you would like to have the bottleneck of the whole operation, either the network adapter, the CPU or the database server, to work non-stop all the time, and not be underutilized (or be completely idle) at various random moments.

    Another consideration is how much parallelization is optimal for each of the heterogeneous operations. 10 DOP may be optimal for the communication with the web, but too low or too high for the communication with the database. The Semaphore approach does not allow for that level of fine-tuning. Your only option is to compromise by selecting a DOP value somewhere between these optimals.

    If the number of urls is very large, lets say 1,000,000, then the Semaphore approach above raises also serious memory usage considerations. A url may have a size of 50 bytes on average, while a Task that is connected to CancellationToken may be 10 times heavier or more. Of course you could change the implementation and use the SemaphoreSlim in a more clever way that doesn't generate so many tasks, but this would go against the primary (and only) selling point of this approach, its simplicity.

    The TPL Dataflow library solves all of these problems, at the cost of the (smallish) learning curve required in order to be able to tame this powerful tool.

提交回复
热议问题