Parallel scraping in .NET

后端未结

关注

 2  1700

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old

相关标签:

2条回答

不知归路

2021-02-10 05:29

TPL Dataflow and async-await are indeed powerful and simple enough to be able to just what you need:

async Task<IEnumerable<string>> GetAllStringsAsync(IEnumerable<string> urls)
{
    var client = new HttpClient();
    var bag = new ConcurrentBag<string>();
    var block = new ActionBlock<string>(
        async url => bag.Add(await client.GetStringAsync(url)),
        new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
    foreach (var url in urls)
    {
        block.Post(url);
    }
    block.Complete();
    await block.Completion;
    return bag;
}

0 讨论(0)

Happy的楠姐

2021-02-10 05:34

I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task<string> DownloadStringAsync(string url)
{
  await _mutex.TakeAsync();
  try
  {
    return await _client.GetStringAsync(url);
  }
  finally
  {
    _mutex.Release();
  }
}

IEnumerable<string> urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));

Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism for the throttling.

0 讨论(0)