I\'m writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page
If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData
method to return a Taskasync
keyword, like so:
async Task ScrapeDataAsync(Uri url)
{
// Create the HttpClientHandler which will handle cookies.
var handler = new HttpClientHandler();
// Set cookies on handler.
// Await on an async call to fetch here, convert to a data
// set and return.
var client = new HttpClient(handler);
// Wait for the HttpResponseMessage.
HttpResponseMessage response = await client.GetAsync(url);
// Get the content, await on the string content.
string content = await response.Content.ReadAsStringAsync();
// Process content variable here into a data set and return.
DataSet ds = ...;
// Return the DataSet, it will return Task.
return ds;
}
Note that you'll probably want to move away from the WebClient
class, as it doesn't support Task
inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient
above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.
However, this means that you will more than likely have to use the await
keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await
on those.
Once that is complete, you would normally call await
on that, but you can't do that in this scenario because you would await
on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task
in an array like so:
DataSet alldata = ...;
var tasks = new List>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url));
}
There is the matter of merging the data into allData
. To that end, you want to call the ContinueWith method on the Task
instance returned and perform the task of adding the data to allData
:
DataSet alldata = ...;
var tasks = new List>();
foreach(var url in the8000urls)
{
// ScrapeData downloads the html from the url with
// WebClient.DownloadString
// and scrapes the data into several datatables which
// it returns as a dataset.
tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
}
Then, you can wait on all the tasks using the WhenAll method on the Task class and await
on that:
// After your loop.
await Task.WhenAll(tasks);
// Process allData
However, note that you have a foreach
, and WhenAll
takes an IEnumerable
DataSet alldata;
var tasks =
from url in the8000Urls
select ScrapeDataAsync(url).ContinueWith(t => {
// Lock access to the data set, since this is
// async now.
lock (allData)
{
// Add the data.
}
});
await Task.WhenAll(tasks);
// Process allData
You can also choose not to use query syntax if you wish, it doesn't matter in this case.
Note that if the containing method is not marked as async
(because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task
returned when you call WhenAll
:
// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();
// Process allData.
Namely, the point is, you want to collect your Task
instances into a sequence and then wait on the entire sequence before you process allData
.
However, I'd suggest trying to process the data before merging it into allData
if you can; unless the data processing requires the entire DataSet
, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.