The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old
TPL Dataflow
and async-await
are indeed powerful and simple enough to be able to just what you need:
async Task<IEnumerable<string>> GetAllStringsAsync(IEnumerable<string> urls)
{
var client = new HttpClient();
var bag = new ConcurrentBag<string>();
var block = new ActionBlock<string>(
async url => bag.Add(await client.GetStringAsync(url)),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
foreach (var url in urls)
{
block.Post(url);
}
block.Complete();
await block.Completion;
return bag;
}
I recommend you use HttpClient
with Task.WhenAll
, with SemaphoreSlim
for simple throttling:
private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task<string> DownloadStringAsync(string url)
{
await _mutex.TakeAsync();
try
{
return await _client.GetStringAsync(url);
}
finally
{
_mutex.Release();
}
}
IEnumerable<string> urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));
Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism
for the throttling.