I\'m writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page
I believe you don't need async
and await
stuff here. They can help in desktop application where you need to move your work to non-GUI thread. In my opinion, it will be better to use Parallel.ForEach
method in your case. Something like this:
DataSet alldata;
var bag = new ConcurrentBag();
Parallel.ForEach(the8000urls, url =>
{
// ScrapeData downloads the html from the url with WebClient.DownloadString
// and scrapes the data into several datatables which it returns as a dataset.
DataSet dataForOnePage = ScrapeData(url);
// Add data for one page to temp bag
bag.Add(dataForOnePage);
});
//merge each table in dataForOnePage into allData from bag
PushAllDataToSql(alldata);