How to limit the amount of concurrent async I/O operations?

前端 未结 14 2467
遇见更好的自我
遇见更好的自我 2020-11-22 01:27
// let\'s say there is a list of 1000+ URLs
string[] urls = { \"http://google.com\", \"http://yahoo.com\", ... };

// now let\'s send HTTP requests to each of these          


        
相关标签:
14条回答
  • 2020-11-22 01:59

    Theo Yaung example is nice, but there is a variant without list of waiting tasks.

     class SomeChecker
     {
        private const int ThreadCount=20;
        private CountdownEvent _countdownEvent;
        private SemaphoreSlim _throttler;
    
        public Task Check(IList<string> urls)
        {
            _countdownEvent = new CountdownEvent(urls.Count);
            _throttler = new SemaphoreSlim(ThreadCount); 
    
            return Task.Run( // prevent UI thread lock
                async  () =>{
                    foreach (var url in urls)
                    {
                        // do an async wait until we can schedule again
                        await _throttler.WaitAsync();
                        ProccessUrl(url); // NOT await
                    }
                    //instead of await Task.WhenAll(allTasks);
                    _countdownEvent.Wait();
                });
        }
    
        private async Task ProccessUrl(string url)
        {
            try
            {
                var page = await new WebClient()
                           .DownloadStringTaskAsync(new Uri(url)); 
                ProccessResult(page);
            }
            finally
            {
                _throttler.Release();
                _countdownEvent.Signal();
            }
        }
    
        private void ProccessResult(string page){/*....*/}
    }
    
    0 讨论(0)
  • 2020-11-22 02:00

    Here is a solution that takes advantage of the lazy nature of LINQ. It is functionally equivalent to the accepted answer), but uses worker-tasks instead of a SemaphoreSlim, reducing this way the memory footprint of the whole operation. At first lets make it work without throttling. The first step is to convert our urls to an enumerable of tasks.

    string[] urls =
    {
        "https://stackoverflow.com",
        "https://superuser.com",
        "https://serverfault.com",
        "https://meta.stackexchange.com",
        // ...
    };
    var httpClient = new HttpClient();
    var tasks = urls.Select(async (url) =>
    {
        return (Url: url, Html: await httpClient.GetStringAsync(url));
    });
    

    The second step is to await all tasks concurrently using the Task.WhenAll method:

    var results = await Task.WhenAll(tasks);
    foreach (var result in results)
    {
        Console.WriteLine($"Url: {result.Url}, {result.Html.Length:#,0} chars");
    }
    

    Output:

    Url: https://stackoverflow.com, 105.574 chars
    Url: https://superuser.com, 126.953 chars
    Url: https://serverfault.com, 125.963 chars
    Url: https://meta.stackexchange.com, 185.276 chars
    ...

    Microsoft's implementation of Task.WhenAll materializes instantly the supplied enumerable to an array, causing all tasks to starts at once. We don't want that, because we want to limit the number of concurrent asynchronous operations. So we'll need to implement an alternative WhenAll that will enumerate our enumerable gently and slowly. We will do that by creating a number of worker-tasks (equal to the desired level of concurrency), and each worker-task will enumerate our enumerable one task at a time, using a lock to ensure that each url-task will be processed by only one worker-task. Then we await for all worker-tasks to complete, and finally we return the results. Here is the implementation:

    public static async Task<T[]> WhenAll<T>(IEnumerable<Task<T>> tasks,
        int concurrencyLevel)
    {
        if (tasks is ICollection<Task<T>>) throw new ArgumentException(
            "The enumerable should not be materialized.", nameof(tasks));
        var locker = new object();
        var results = new List<T>();
        var failed = false;
        using (var enumerator = tasks.GetEnumerator())
        {
            var workerTasks = Enumerable.Range(0, concurrencyLevel)
            .Select(async _ =>
            {
                try
                {
                    while (true)
                    {
                        Task<T> task;
                        int index;
                        lock (locker)
                        {
                            if (failed) break;
                            if (!enumerator.MoveNext()) break;
                            task = enumerator.Current;
                            index = results.Count;
                            results.Add(default); // Reserve space in the list
                        }
                        var result = await task.ConfigureAwait(false);
                        lock (locker) results[index] = result;
                    }
                }
                catch (Exception)
                {
                    lock (locker) failed = true;
                    throw;
                }
            }).ToArray();
            await Task.WhenAll(workerTasks).ConfigureAwait(false);
        }
        lock (locker) return results.ToArray();
    }
    

    ...and here is what we must change in our initial code, to achieve the desired throttling:

    var results = await WhenAll(tasks, concurrencyLevel: 2);
    

    There is a difference regarding the handling of the exceptions. The native Task.WhenAll waits for all tasks to complete and aggregates all the exceptions. The implementation above terminates promptly after the completion of the first faulted task.

    0 讨论(0)
  • 2020-11-22 02:00

    This is my second answer, with a possibly improved version of Theo Yaung's solution (the accepted answer). This is based too on a SemaphoreSlim and does a lazy enumeration of the urls, but is not relying on the Task.WhenAll for awaiting the tasks to complete. The SemaphoreSlim is used for this purpose too. This can be an advantage because it means that the completed tasks need not be referenced during the whole operation. Instead each task is eligible for garbage collection immediately after its completion.

    Two overloads of the ForEachAsync extension method are provided (the name is borrowed from Dogu Arslan's answer, the next most popular answer). One is for tasks that return a result, and one for tasks that do not. A nice extra feature is the onErrorContinue parameter, that controls the behavior in case of exceptions. The default is false, which mimics the behavior of Parallel.ForEach (that stops processing shortly after an exception), and not the behavior of Task.WhenAll (that waits for all tasks to complete).

    public static async Task<TResult[]> ForEachAsync<TSource, TResult>(
        this IEnumerable<TSource> source,
        Func<TSource, Task<TResult>> taskFactory,
        int concurrencyLevel = 1,
        bool onErrorContinue = false)
    {
        // Arguments validation omitted
        var throttler = new SemaphoreSlim(concurrencyLevel);
        var results = new List<TResult>();
        var exceptions = new ConcurrentQueue<Exception>();
        int index = 0;
        foreach (var item in source)
        {
            var localIndex = index++;
            lock (results) results.Add(default); // Reserve space in the list
            await throttler.WaitAsync(); // continue on captured context
            if (!onErrorContinue && !exceptions.IsEmpty) { throttler.Release(); break; }
    
            Task<TResult> task;
            try { task = taskFactory(item); } // or Task.Run(() => taskFactory(item))
            catch (Exception ex)
            {
                exceptions.Enqueue(ex); throttler.Release();
                if (onErrorContinue) continue; else break;
            }
    
            _ = task.ContinueWith(t =>
            {
                try { lock (results) results[localIndex] = t.GetAwaiter().GetResult(); }
                catch (Exception ex) { exceptions.Enqueue(ex); }
                finally { throttler.Release(); }
            }, default, TaskContinuationOptions.ExecuteSynchronously,
                TaskScheduler.Default);
        }
    
        // Wait for the last operations to complete
        for (int i = 0; i < concurrencyLevel; i++)
        {
            await throttler.WaitAsync().ConfigureAwait(false);
        }
        if (!exceptions.IsEmpty) throw new AggregateException(exceptions);
        lock (results) return results.ToArray();
    }
    
    public static Task ForEachAsync<TSource>(
        this IEnumerable<TSource> source,
        Func<TSource, Task> taskFactory,
        int concurrencyLevel = 1,
        bool onErrorContinue = false)
    {
        // Arguments validation omitted
        return ForEachAsync<TSource, object>(source, async item =>
        {
            await taskFactory(item).ConfigureAwait(false); return null;
        }, concurrencyLevel, onErrorContinue);
    }
    

    The taskFactory is invoked on the context of the caller. This can be desirable because it allows (for example) UI elements to be accessed inside the lambda. In case it is preferable to invoke it on the ThreadPool context, you can just replace the taskFactory(item) with Task.Run(() => taskFactory(item)).

    To keep things simple, the Task ForEachAsync is implemented not optimally by calling the generic Task<TResult[]> overload.

    Usage example:

    await urls.ForEachAsync(async url =>
    {
        var html = await httpClient.GetStringAsync(url);
        TextBox1.AppendText($"Url: {url}, {html.Length:#,0} chars\r\n");
    }, concurrencyLevel: 10, onErrorContinue: true);
    
    0 讨论(0)
  • 2020-11-22 02:03

    There are a lot of pitfalls and direct use of a semaphore can be tricky in error cases, so I would suggest to use AsyncEnumerator NuGet Package instead of re-inventing the wheel:

    // let's say there is a list of 1000+ URLs
    string[] urls = { "http://google.com", "http://yahoo.com", ... };
    
    // now let's send HTTP requests to each of these URLs in parallel
    await urls.ParallelForEachAsync(async (url) => {
        var client = new HttpClient();
        var html = await client.GetStringAsync(url);
    }, maxDegreeOfParalellism: 20);
    
    0 讨论(0)
  • 2020-11-22 02:08

    You can definitely do this in the latest versions of async for .NET, using .NET 4.5 Beta. The previous post from 'usr' points to a good article written by Stephen Toub, but the less announced news is that the async semaphore actually made it into the Beta release of .NET 4.5

    If you look at our beloved SemaphoreSlim class (which you should be using since it's more performant than the original Semaphore), it now boasts the WaitAsync(...) series of overloads, with all of the expected arguments - timeout intervals, cancellation tokens, all of your usual scheduling friends :)

    Stephen's also written a more recent blog post about the new .NET 4.5 goodies that came out with beta see What’s New for Parallelism in .NET 4.5 Beta.

    Last, here's some sample code about how to use SemaphoreSlim for async method throttling:

    public async Task MyOuterMethod()
    {
        // let's say there is a list of 1000+ URLs
        var urls = { "http://google.com", "http://yahoo.com", ... };
    
        // now let's send HTTP requests to each of these URLs in parallel
        var allTasks = new List<Task>();
        var throttler = new SemaphoreSlim(initialCount: 20);
        foreach (var url in urls)
        {
            // do an async wait until we can schedule again
            await throttler.WaitAsync();
    
            // using Task.Run(...) to run the lambda in its own parallel
            // flow on the threadpool
            allTasks.Add(
                Task.Run(async () =>
                {
                    try
                    {
                        var client = new HttpClient();
                        var html = await client.GetStringAsync(url);
                    }
                    finally
                    {
                        throttler.Release();
                    }
                }));
        }
    
        // won't get here until all urls have been put into tasks
        await Task.WhenAll(allTasks);
    
        // won't get here until all tasks have completed in some way
        // (either success or exception)
    }
    

    Last, but probably a worthy mention is a solution that uses TPL-based scheduling. You can create delegate-bound tasks on the TPL that have not yet been started, and allow for a custom task scheduler to limit the concurrency. In fact, there's an MSDN sample for it here:

    See also TaskScheduler .

    0 讨论(0)
  • 2020-11-22 02:08

    If you have an IEnumerable (ie. strings of URL s) and you want to do an I/O bound operation with each of these (ie. make an async http request) concurrently AND optionally you also want to set the maximum number of concurrent I/O requests in real time, here is how you can do that. This way you do not use thread pool et al, the method uses semaphoreslim to control max concurrent I/O requests similar to a sliding window pattern one request completes, leaves the semaphore and the next one gets in.

    usage: await ForEachAsync(urlStrings, YourAsyncFunc, optionalMaxDegreeOfConcurrency);

    public static Task ForEachAsync<TIn>(
            IEnumerable<TIn> inputEnumerable,
            Func<TIn, Task> asyncProcessor,
            int? maxDegreeOfParallelism = null)
        {
            int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
            SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
    
            IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
            {
                await throttler.WaitAsync().ConfigureAwait(false);
                try
                {
                    await asyncProcessor(input).ConfigureAwait(false);
                }
                finally
                {
                    throttler.Release();
                }
            });
    
            return Task.WhenAll(tasks);
        }
    
    0 讨论(0)
提交回复
热议问题