I am trying to parallelize my web parsing tool but the speed gains seem very minimal. I have i7-2600K (8 cores hyper-threading).
Here is some code to show you the id
A brief perusal of HtmlAgilityPack.HtmlWeb
confirms that it is using the synchronous WebRequest
API. You are therefore placing long running tasks into the ThreadPool (via Parallel
). The ThreadPool is designed for short-lived operations that yield the thread back to the pool quickly. Blocking on IO is a big no-no. Given the ThreadPool's reluctance to start new threads (because it is not designed for this kind of usage), you're going to be constrained by this behaviour.
Fetch your web content asynchronously (see here and here for the correct API to use, you'll have to investigate further yourself...) so that you are not tying up the ThreadPool with blocking tasks. You can then feed the decoded response to the HtmlAgilityPack for parsing.
If you really want to jazz up performance, you'll also need to consider that WebRequest is incapable of performing asynchronous DNS lookup. IMO this is a terrible flaw in the design of WebRequest.
The BeginGetResponse method requires some synchronous setup tasks to complete (DNS resolution, proxy detection, and TCP socket connection, for example) before this method becomes asynchronous.
It makes high performance downloading a real PITA. It's at about this time that you might consider writing your own HTTP library so that everything can execute without blocking (and therefore starving the ThreadPool).
As an aside, getting maximum throughput when chumming through web-pages is a tricky affair. In my experience, you get the code right and are then let down by the routing equipment it has to go through. Many domestic routers simply aren't up to the job.