I\'m using parallel linq, and I\'m trying to download many urls concurrently using essentily code like this:
int threads = 10;
Dictionary
Do the URLs refer to the same server? If so, it could be that you are hitting the HTTP connection limit instead of the threading limit. There's an easy way to tell - change your code to:
int threads = 10;
Dictionary<string, string> results = urls.AsParallel(threads)
.ToDictionary(url => url,
url => {
Console.WriteLine("On thread {0}",
Thread.CurrentThread.ManagedThreadId);
return GetPage(url);
});
EDIT: Hmm. I can't get ToDictionary()
to parallelise at all with a bit of sample code. It works fine for Select(url => GetPage(url))
but not ToDictionary
. Will search around a bit.
EDIT: Okay, I still can't get ToDictionary
to parallelise, but you can work around that. Here's a short but complete program:
using System;
using System.Collections.Generic;
using System.Threading;
using System.Linq;
using System.Linq.Parallel;
public class Test
{
static void Main()
{
var urls = Enumerable.Range(0, 100).Select(i => i.ToString());
int threads = 10;
Dictionary<string, string> results = urls.AsParallel(threads)
.Select(url => new { Url=url, Page=GetPage(url) })
.ToDictionary(x => x.Url, x => x.Page);
}
static string GetPage(string x)
{
Console.WriteLine("On thread {0} getting {1}",
Thread.CurrentThread.ManagedThreadId, x);
Thread.Sleep(2000);
return x;
}
}
So, how many threads does this use? 5. Why? Goodness knows. I've got 2 processors, so that's not it - and we've specified 10 threads, so that's not it. It still uses 5 even if I change GetPage
to hammer the CPU.
If you only need to use this for one particular task - and you don't mind slightly smelly code - you might be best off implementing it yourself, to be honest.
Monitor your network traffic. If the URLs are from the same domain it may be limiting the bandwidth. More connections might not actually provide any speed-up.
By default, .Net has limit of 2 concurrent connections to an end service point (IP:port). Thats why you would not see a difference if all urls are to one and the same server.
It can be controlled using ServicePointManager.DefaultPersistentConnectionLimit property.
I think there are already good answers to the question, but I'd like to make one important point. Using PLINQ for tasks that are not CPU bound is in principle wrong design. Not to say that it won't work - it will, but using multiple threads when it is unnecessary can cause troubles.
Unfortunatelly, there is no good way to solve this problem in C#. In F# you could use asynchornous workflows that run in parallel, but don't block the thread when performing asynchronous calls (under the cover, it uses BeginOperation
and EndOperation
methods). You can find more information here:
The same idea can to some extent be used in C#, but it looks a bit weird (but it is more efficient). I wrote an article about that and there is also a library that should be slightly more evolved than my original idea: