best way to take an intersection of more than two hashsets in c#, when we donot know before hand how many hashsets are there

问题

I am making a boolean retrieval system for some large no. of documents, in which i have made a dictionary of hashsets, and the the entries into the dictionary are the terms, and the hashsets contains the documentids in which the term was found. Now when i want to search for a single word, i will simply enter the word and i will index the dictionary using the entered word in query and print out the corresponding hashset. But i also want to search for sentences, in this case i will split the query into individual words and index the dictionary by those words, now depending upon the number of words in the query, that many number of hash sets will be returned, now i will want to take an intersection of these hash sets so that i can return the document ids in which i find out the words in the query. My question is what is the best way to take intersection of these hash sets?

Currently i am putting the hash sets into a list, and then i take intersection of these n no. of hashsets two at a time and then take the intersection of result of first two and then the third one and so on...

This is the code

Dictionary<string, HashSet<string>> dt = new Dictionary<string, HashSet<string>>();//assume it is filled with data...

while (true)
            {
                Console.WriteLine("\n\n\nEnter the query you want to search");
                string inp = Console.ReadLine();
                string[] words = inp.Split(new Char[] { ' ', ',', '.', ':', '?', '!', '\t' });

                List<HashSet<string>> outparr = new List<HashSet<string>>();
                foreach(string w in words)
                {
                    HashSet<string> outp = new HashSet<string>();
                    if (dt.TryGetValue(w, out outp))
                    {
                        outparr.Add(outp);
                        Console.WriteLine("Found {0} documents.", outp.Count);
                        foreach (string s in outp)
                        {
                            Console.WriteLine(s);
                        }
                    }
                }

                HashSet<string> temp = outparr.First();
                foreach(HashSet<string> hs in outparr)
                {
                    temp = new HashSet<string>(temp.Intersect(hs));
                }

                Console.WriteLine("Output After Intersection:");
                Console.WriteLine("Found {0} documents: ", temp.Count);
                foreach(string s in temp)
                {
                    Console.WriteLine(s);
                }

            }

回答1:

IntersectWith is a good aproach. Like this:

            HashSet<string> res = null;
            HashSet<string> outdictinary = null;
            foreach(string w in words)
            {
                if (dt.TryGetValue(w, out outdictinary))
                {
                    if( res==null)
                        res =new HashSet( outdictinary,outdictinary.Comparer);
                    else
                    {   
                        if (res.Count==0)
                             break;
                        res.IntersectWith(outdictinary);
                    }
                }
            }
            if (res==null) res = new HashSet();
            Console.WriteLine("Output After Intersection:");
            Console.WriteLine("Found {0} documents: ", res.Count);
            foreach(string s in res)
            {
                Console.WriteLine(s);
            }

回答2:

The principle that you are using is sound, but you can tweak it a bit.

By sorting the hash sets on size, you can start with the smallest one, that way you can minimise the number of comparisons.

Instead of using the IEnumerable<>.Intersect method you can do the same thing in a loop, but using the fact that you already have a hash set. Checking if a value exists in a hash set is very fast, so you can just loop through the items in the smallest set and look for matching values in the next set, and put them in a new set.

In the loop you can skip the first item as you start with that. You don't need to intersect it with itself.

outparr = outparr.OrderBy(o => o.Count).ToList();

HashSet<string> combined = outparr[0];
foreach(HashSet<string> hs in outparr.Skip(1)) {
  HashSet<string> temp = new HashSet<string>();
  foreach (string s in combined) {
    if (hs.Contains(s)) {
      temp.Add(s);
    }
  }
  combined = temp;
}

回答3:

To answer your question, it's possible that at one point you'll find a set of documents that contains words a, b and c and another set that contains only other words in your query so the intersection can become empty after a few iterations. You can check for this and break out of the foreach.

Now, IMHO it doesn't make sense to do that intersection because usualy a search result should contain multiple files ordered by relevance. It will also be much easier because you already have a list of files containing one word. From the hashes obtained for each word you'll have to count the occurences of file ids and return a limited number of ids ordered descending by the number of occurences.

来源：https://stackoverflow.com/questions/28766387/best-way-to-take-an-intersection-of-more-than-two-hashsets-in-c-when-we-donot

标签

c#-4.0

dictionary

hashset