Parallel scraping in .NET

后端 未结 2 1704
攒了一身酷
攒了一身酷 2021-02-10 04:45

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old

2条回答
  •  Happy的楠姐
    2021-02-10 05:34

    I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

    private SemaphoreSlim _mutex = new SemaphoreSlim(5);
    private HttpClient _client = new HttpClient();
    private async Task DownloadStringAsync(string url)
    {
      await _mutex.TakeAsync();
      try
      {
        return await _client.GetStringAsync(url);
      }
      finally
      {
        _mutex.Release();
      }
    }
    
    IEnumerable urls = ...;
    var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));
    

    Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism for the throttling.

提交回复
热议问题