Parallel scraping in .NET

后端 未结 2 1699
攒了一身酷
攒了一身酷 2021-02-10 04:45

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old

2条回答
  •  不知归路
    2021-02-10 05:29

    TPL Dataflow and async-await are indeed powerful and simple enough to be able to just what you need:

    async Task> GetAllStringsAsync(IEnumerable urls)
    {
        var client = new HttpClient();
        var bag = new ConcurrentBag();
        var block = new ActionBlock(
            async url => bag.Add(await client.GetStringAsync(url)),
            new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
        foreach (var url in urls)
        {
            block.Post(url);
        }
        block.Complete();
        await block.Completion;
        return bag;
    }
    

提交回复
热议问题