Parallel scraping in .NET

后端未结

关注

 2  1704

攒了一身酷 2021-02-10 04:45

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old

2条回答

Happy的楠姐 (楼主)

2021-02-10 05:34

I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task DownloadStringAsync(string url)
{
  await _mutex.TakeAsync();
  try
  {
    return await _client.GetStringAsync(url);
  }
  finally
  {
    _mutex.Release();
  }
}

IEnumerable urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));

Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism for the throttling.

0 讨论(0)

查看其它2个回答