Fast count() intersection of two string arrays

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-02 03:17:34

问题


I need to count the number of elements corresponding to the intersection of two big arrays of strings and do it very fast.

I am using the following code:

arr1[i].Intersect(arr2[j]).Count()

For CPU Time, VS Profiler indicates

  • 85.1% in System.Linq.Enumerable.Count()
  • 0.3% in System.Linq.Enumerable.Intersect()

Unfortunately it might to take hours to do all work.

How to do it faster?


回答1:


You can use HashSet with arr2

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

Not considering the conversion from arr2 to arr2Set ,this should be O(n)




回答2:


I suspect the reason why the profiler shows the time being consumed in Count, is that this is where the collection is actually enumerated (the Intersect is lazily evaluated and does not run before you need the result).

I believe Intersect should have some internal optimizations to make this reasonably fast, but you could try using a HashSet<string> so you are sure the intersect can be made without searching through the inner array for each element:

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;



回答3:


Hmmm Intersect is probably N^2

to make it faster quicksort both arrays. and than traverse both arrays. counting intersections.

too lazy to test how fast it would be but should O(nlogn +n)

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            {
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            }
            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

        {

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        }
            {
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }
        }
    }
}

results

array 914,637

HashSet 816,119

HashSet +new 1,150,978

SortedSet +new 16,173,836

SortedSet without new 7,946,709

so seems that best way is to keep a ready hash set.




回答4:


when you are working with sets, your complexity will be O((n log n)*(m log m)) or so,

i think this here should be faster, but i'm not sure if it is now O((n log n)+(m log m))

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();



回答5:


Build a HashSet using the smaller array and then loop through the bigger one, incrementing a counter if the item it exists in the hashset.



来源:https://stackoverflow.com/questions/13900514/fast-count-intersection-of-two-string-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!