I\'m working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hash
Thanks for All, I managed to solve this issue using Dictionary instead of Hashset, and use the Key to hold the URL , and the value to hold int , to be 1 if the urls is scanned , or 0 if the url still not processed, below is my code. I used another Dictionary "URL_CANtoSave to hold the url that ends with jpg "my target"...and this loop of While..can loop until all the url of the website ran out based on the values you specify in the filter string variable that you parse the urls accordingly.
so to break the loop you can specify amount of images url to get in the URL_CantoSave.
return Task.Factory.StartNew(() =>
{
try
{
string tempURL;
int i = 0;
// I used to set the value of Dictionary Key, 1 or 0 ( 1 means scanned, 0 means not yet and to iterate until all the Dictionry Keys are scanned or you break in the middle based on how much images urls you collected in the other Dictionary
while (URL_Can.Values.Where(value => value.Equals(0)).Any())
{
// take 1 key and put it in temp variable
tempURL = URL_Can.ElementAt(i).Key;
// check if it ends with your target file extension. in this case image file
if (tempURL.EndsWith("jpg"))
{
URL_CanToSave.Add(tempURL,0);
URL_Can.Remove(tempURL);
}
// if not image go and download the page based on the url and keep analyzing
else
{
// if the url not scanned before then
if (URL_Can[tempURL] != 1)
{
// here it seems complex little bit, where Add2Dic is process to add to Dictionaries without adding the Key again ( solving main problem !! ) "ExtractURLfromLink" is another process that return dictionary with all links analyzed by downloading the document string of the url and analyzing it , you can add remove filter string based on you analysis
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);
URL_Can[tempURL] = 1; // to set it as scanned link
richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
}
}
statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());
// here comes the other trick to keep this iteration keeps going until it scans all gathered links
i++; if (i >= URL_Can.Count) { i = 0; }
if (URL_CanToSave.Count >= 150) { break; }
}
richTextBox2.PerformSafely(() => richTextBox2.Clear());
textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());
return ProcessCompleted = true;
}
catch (Exception aih)
{
MessageBox.Show(aih.Message);
return ProcessCompleted = false;
throw;
}
{
richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
}
})