Hashset handling to avoid stuck in loop during iteration

问题

I'm working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hashset that contains the main urls and within the iteration I go and download the the page of the main URL and add them to the Hashet, and go on , and during iteration I should exclude every scanned url, and also exclude ( remove ) every url that end with jpg, this until the Hashet of url count reaches 0, the question is that I faced endless looping in this iteration , where I may get url ( lets call it X )

1- I scan the page of url X 2- get all urls of page X ( by applying filters ) 3- Add urls to the Hashset using unioinwith 4- remove the scanned url X

the problem comes here when one of the URLs Y, when scanned bring X again

shall I use Dictionary and the key as "scanned" ?? I will try and post the result here, sorry it comes to my mind after I posted the question...

I managed to solve it for one url, but it seems it happens with other urls to generate loop, so how to handle the Hashset to avoid duplicate even after removing the links,,, I hope that my point is clear.

while (URL_Can.Count != 0)
 {

                  tempURL = URL_Can.First();

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL);
                        URL_Can.Remove(tempURL);

                    }
                    else
                    {

                        if (ExtractUrlsfromLink(client, tempURL, filterlink1).Contains(toAvoidLoopinLinks))
                        {

                            URL_Can.Remove(tempURL);

                            URL_Can.Remove(toAvoidLoopinLinks);
                        }
                        else 
                        {
                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink1));

                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink2));

                            URL_Can.Remove(tempURL);

                            richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }

                    }

                   toAvoidLoopinLinks = tempURL;

                }

回答1:

Thanks for All, I managed to solve this issue using Dictionary instead of Hashset, and use the Key to hold the URL , and the value to hold int , to be 1 if the urls is scanned , or 0 if the url still not processed, below is my code. I used another Dictionary "URL_CANtoSave to hold the url that ends with jpg "my target"...and this loop of While..can loop until all the url of the website ran out based on the values you specify in the filter string variable that you parse the urls accordingly.

so to break the loop you can specify amount of images url to get in the URL_CantoSave.

  return Task.Factory.StartNew(() =>
        {
            try
            {


                string tempURL;

                int i = 0;

// I used to set the value of Dictionary Key, 1 or 0 ( 1 means scanned, 0 means not yet and to iterate until all the Dictionry Keys are scanned or you break in the middle based on how much images urls you collected in the other Dictionary

               while (URL_Can.Values.Where(value => value.Equals(0)).Any())


                {

// take 1 key and put it in temp variable

                    tempURL = URL_Can.ElementAt(i).Key;

// check if it ends with your target file extension. in this case image file

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL,0);

                        URL_Can.Remove(tempURL);

                    }

// if not image go and download the page based on the url and keep analyzing

                    else
                    {

// if the url not scanned before then

                        if (URL_Can[tempURL] != 1) 
                        {

// here it seems complex little bit, where Add2Dic is process to add to Dictionaries without adding the Key again ( solving main problem !! ) "ExtractURLfromLink" is another process that return dictionary with all links analyzed by downloading the document string of the url and analyzing it , you can add remove filter string based on you analysis

Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);

 URL_Can[tempURL] = 1;  //  to set it as scanned link


    richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }



                    }


        statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());

// here comes the other trick to keep this iteration keeps going until it scans all gathered links

                    i++;  if (i >= URL_Can.Count) { i = 0; }

                    if (URL_CanToSave.Count >= 150) { break; }

                }


                richTextBox2.PerformSafely(() => richTextBox2.Clear());

                textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());


                return ProcessCompleted = true;




            }
            catch (Exception aih)
            {

                MessageBox.Show(aih.Message);

                return ProcessCompleted = false;

                throw;
            }


            {
              richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
            }
        })

来源：https://stackoverflow.com/questions/42175435/hashset-handling-to-avoid-stuck-in-loop-during-iteration

标签

url

hashset

mining