Hashset handling to avoid stuck in loop during iteration

前端未结

关注

 1  1653

情歌与酒 2021-01-26 13:29

I\'m working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hash

1条回答

抹茶落季 (楼主)

2021-01-26 14:04

Thanks for All, I managed to solve this issue using Dictionary instead of Hashset, and use the Key to hold the URL , and the value to hold int , to be 1 if the urls is scanned , or 0 if the url still not processed, below is my code. I used another Dictionary "URL_CANtoSave to hold the url that ends with jpg "my target"...and this loop of While..can loop until all the url of the website ran out based on the values you specify in the filter string variable that you parse the urls accordingly.

so to break the loop you can specify amount of images url to get in the URL_CantoSave.

  return Task.Factory.StartNew(() =>
        {
            try
            {


                string tempURL;

                int i = 0;

// I used to set the value of Dictionary Key, 1 or 0 ( 1 means scanned, 0 means not yet and to iterate until all the Dictionry Keys are scanned or you break in the middle based on how much images urls you collected in the other Dictionary

               while (URL_Can.Values.Where(value => value.Equals(0)).Any())


                {

// take 1 key and put it in temp variable

                    tempURL = URL_Can.ElementAt(i).Key;

// check if it ends with your target file extension. in this case image file

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL,0);

                        URL_Can.Remove(tempURL);

                    }

// if not image go and download the page based on the url and keep analyzing

                    else
                    {

// if the url not scanned before then

                        if (URL_Can[tempURL] != 1) 
                        {

// here it seems complex little bit, where Add2Dic is process to add to Dictionaries without adding the Key again ( solving main problem !! ) "ExtractURLfromLink" is another process that return dictionary with all links analyzed by downloading the document string of the url and analyzing it , you can add remove filter string based on you analysis

Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);

 URL_Can[tempURL] = 1;  //  to set it as scanned link


    richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }



                    }


        statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());

// here comes the other trick to keep this iteration keeps going until it scans all gathered links

                    i++;  if (i >= URL_Can.Count) { i = 0; }

                    if (URL_CanToSave.Count >= 150) { break; }

                }


                richTextBox2.PerformSafely(() => richTextBox2.Clear());

                textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());


                return ProcessCompleted = true;




            }
            catch (Exception aih)
            {

                MessageBox.Show(aih.Message);

                return ProcessCompleted = false;

                throw;
            }


            {
              richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
            }
        })

0 讨论(0)