The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old
I recommend you use HttpClient
with Task.WhenAll
, with SemaphoreSlim
for simple throttling:
private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task DownloadStringAsync(string url)
{
await _mutex.TakeAsync();
try
{
return await _client.GetStringAsync(url);
}
finally
{
_mutex.Release();
}
}
IEnumerable urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));
Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism
for the throttling.