I\'m writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page
I recommend reading my reasonably-complete introduction to async/await.
First, make everything asynchronous, starting at the lower-level stuff:
public static async Task ScrapeDataAsync(string pageid)
{
CookieAwareWebClient webClient = ...;
var dsPageData = new DataSet();
// DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
string url = @"https://domain.com?&id=" + pageid + @"restofurl";
string html = await webClient.DownloadStringTaskAsync(url).ConfigureAwait(false);
var doc = new HtmlDocument();
doc.LoadHtml(html);
// A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData
return dsPageData;
}
Then you can consume it as follows (using async
with LINQ):
DataSet alldata;
var tasks = the8000urls.Select(async url =>
{
var dataForOnePage = await ScrapeDataAsync(url);
//merge each table in dataForOnePage into allData
});
await Task.WhenAll(tasks);
PushAllDataToSql(alldata);
And use AsyncContext from my AsyncEx library since this is a console app:
class Program
{
static int Main(string[] args)
{
try
{
return AsyncContext.Run(() => MainAsync(args));
}
catch (Exception ex)
{
Console.Error.WriteLine(ex);
return -1;
}
}
static async Task MainAsync(string[] args)
{
...
}
}
That's it. No need for locking or continuations or any of that.