Parallel scraping in .NET

后端 未结 2 1700
攒了一身酷
攒了一身酷 2021-02-10 04:45

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old

相关标签:
2条回答
  • 2021-02-10 05:29

    TPL Dataflow and async-await are indeed powerful and simple enough to be able to just what you need:

    async Task<IEnumerable<string>> GetAllStringsAsync(IEnumerable<string> urls)
    {
        var client = new HttpClient();
        var bag = new ConcurrentBag<string>();
        var block = new ActionBlock<string>(
            async url => bag.Add(await client.GetStringAsync(url)),
            new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
        foreach (var url in urls)
        {
            block.Post(url);
        }
        block.Complete();
        await block.Completion;
        return bag;
    }
    
    0 讨论(0)
  • 2021-02-10 05:34

    I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

    private SemaphoreSlim _mutex = new SemaphoreSlim(5);
    private HttpClient _client = new HttpClient();
    private async Task<string> DownloadStringAsync(string url)
    {
      await _mutex.TakeAsync();
      try
      {
        return await _client.GetStringAsync(url);
      }
      finally
      {
        _mutex.Release();
      }
    }
    
    IEnumerable<string> urls = ...;
    var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));
    

    Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism for the throttling.

    0 讨论(0)
提交回复
热议问题