Recently I started working on trying to mass-scrape a website for archiving purposes and I thought it would be a good idea to have multiple web requests working asynchronous
The tasks (by default) runs on the threadpool, which is just as it sounds, a pool of threads. The threadpool is optimized for a lot of situations, but throwing Thread.Sleep
in there probably throws a wrench in most of them. Also, Task.Factory.StartNew
is a generally a bad idea to use, because people doesn't understand how it works. Try this instead:
static void Main(string[] args)
{
for (int i = 0; i < 10; i++) {
int i2 = i + 1;
Stopwatch t = new Stopwatch();
t.Start();
Task.Run(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
});
}
Console.ReadKey();
}
The threadpool has a limited number of threads at it's disposal. This number changes depending on certain conditions, however, in general it holds true. For this reason, you should never do anything blocking on the threadpool (if you want to achieve parallelism that is). Thread.Sleep
is a perfect example of a blocking API, but so is most web request APIs, unless you use the newer async versions.
So the problem in your original program with crawling is probably the same as in the sample you posted. You are blocking all the thread pool threads, and thus it's getting forced to spin up new threads, and ends up clogging.
Coincidentally, using Task.Run
in this way also easily allows you to rewrite the code in such a way that you can know when it's complete. By storing a reference to all of the started tasks, and awaiting them all at the end (this does not prevent parallelism), you can reliably know when all the tasks have completed. The following shows how to achieve that:
static void Main(string[] args)
{
var tasks = new List<Task>();
for (int i = 0; i < 10; i++) {
int i2 = i + 1;
Stopwatch t = new Stopwatch();
t.Start();
tasks.Add(Task.Run(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
}));
}
Task.WaitAll(tasks.ToArray());
Console.WriteLine("All tasks completed");
Console.ReadKey();
}
Note: this code has not been tested
More info on Task.Factory.StartNew
and why it should be avoided: http://blog.stephencleary.com/2013/08/startnew-is-dangerous.html.
I think this is occurring because you have exhausted all available threads in the thread pool. Try starting your tasks using TaskCreationOptions.LongRunning
. More details here.
Another problem is that you are using Thread.Sleep
, this blocks the current thread and its a waste of resources. Try waiting asynchronously using await Task.Delay
. You may need to change your lambda to be async
.
Task.Factory.StartNew(async () => {
t.Stop();
Console.ForegroundColor = ConsoleColor.Green; //Note that the other tasks might manage to write their lines between these colour changes messing up the colours.
Console.WriteLine("Task " + i2 + " started after " + t.Elapsed.Seconds + "." + t.Elapsed.Milliseconds + "s");
await Task.Delay(5000);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Task " + i2 + " finished");
});