I\'m writing a C# console application that scrapes data from web pages.
This application will go to about 8000 web pages and scrape data(same format of data on each page
You could also use TPL Dataflow, which is a good fit for this kind of problem.
In this case, you build a "dataflow mesh" and then your data flows through it.
This one is actually more like a pipeline than a "mesh". I'm putting in three steps: Download the (string) data from the URL; Parse the (string) data into HTML and then into a DataSet
; and Merge the DataSet
into the master DataSet
.
First, we create the blocks that will go in the mesh:
DataSet allData;
var downloadData = new TransformBlock(
async pageid =>
{
System.Net.WebClient webClient = null;
var url = "https://domain.com?&id=" + pageid + "restofurl";
return await webClient.DownloadStringTaskAsync(url);
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var parseHtml = new TransformBlock(
html =>
{
var dsPageData = new DataSet();
var doc = new HtmlDocument();
doc.LoadHtml(html);
// HTML Agility parsing
return dsPageData;
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
});
var merge = new ActionBlock(
dataForOnePage =>
{
// merge dataForOnePage into allData
});
Then we link the three blocks together to create the mesh:
downloadData.LinkTo(parseHtml);
parseHtml.LinkTo(merge);
Next, we start pumping data into the mesh:
foreach (var pageid in the8000urls)
downloadData.Post(pageid);
And finally, we wait for each step in the mesh to complete (this will also cleanly propagate any errors):
downloadData.Complete();
await downloadData.Completion;
parseHtml.Complete();
await parseHtml.Completion;
merge.Complete();
await merge.Completion;
The nice thing about TPL Dataflow is that you can easily control how parallel each part is. For now, I've set both the download and parsing blocks to be Unbounded
, but you may want to restrict them. The merge block uses the default maximum parallelism of 1, so no locks are necessary when merging.