C# Download data from huge list of urls [duplicate]

你。 提交于 2019-11-26 16:48:00

问题


I have a huge list of web pages which display a status, which i need to check. Some urls are within the same site, another set is located on another site.

Right now i'm trying to do this in a parallel way by using code like below, but i have the feeling that i'm causing too much overhead.

while(ListOfUrls.Count > 0){
  Parallel.ForEach(ListOfUrls, url =>
  {
    WebClient webClient = new WebClient();
    webClient.DownloadString(url);
    ... run my checks here.. 
  });

  ListOfUrls = GetNewUrls.....
}

Can this be done with less overhead, and some more control over how many webclients and connections i use/reuse? So, that in the end the job can be done faster?


回答1:


Parallel.ForEach is good for CPU-bound computational tasks, but it will unnecessary block pool threads for synchronous IO-bound calls like DownloadString in your case. You can improve the scalability of your code and reduce the number of threads it may use, by using DownloadStringTaskAsync and tasks instead:

// non-blocking async method
async Task<string> ProcessUrlAsync(string url)
{
    using (var webClient = new WebClient())
    {
        string data = await webClient.DownloadStringTaskAsync(new Uri(url));
        // run checks here.. 
        return data;
    }
}

// ...

if (ListOfUrls.Count > 0) {
    var tasks = new List<Task>();
    foreach (var url in ListOfUrls)
    {
      tasks.Add(ProcessUrlAsync(url));
    }

    Task.WaitAll(tasks.ToArray()); // blocking wait

    // could use await here and make this method async:
    // await Task.WhenAll(tasks.ToArray());
}



回答2:


you can try using HttpClient a new addition in .Net 4.5 it consider to be be faster and it might improve your performance a little

using (HttpClient client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(url))
using (HttpContent content = response.Content)
{

    string result = await content.ReadAsStringAsync();


}



回答3:


An oft-overlooked element in the web.config or app.config files of your application is the connectionManagement tag. In particular, .NET will limit the simultaneous number of connections to a domain to 2 by default. You can see the documentation for the tag here.

If I understood your question correctly, it stands to reason that parallel-creating web clients to 2 domains will be limited to 4 threads by default (2 threads per domain), causing less speedup than you would otherwise expect.

If you are connecting to multiple domains, however, then the other answers are likely to yield more speedup since waiting on the response is probably a large part of the cost of each loop iteration. If you are on .NET 4.5, GetStringAsync method is probably your friend.




回答4:


Did you think about asynchronous execution of your code? I think there is no faster way to get data from Internet but you can do in simultaneously.



来源:https://stackoverflow.com/questions/19389938/c-sharp-download-data-from-huge-list-of-urls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!