问题
I need to count the number of elements corresponding to the intersection of two big arrays of strings and do it very fast.
I am using the following code:
arr1[i].Intersect(arr2[j]).Count()
For CPU Time, VS Profiler indicates
- 85.1% in
System.Linq.Enumerable.Count()
- 0.3% in
System.Linq.Enumerable.Intersect()
Unfortunately it might to take hours to do all work.
How to do it faster?
回答1:
You can use HashSet
with arr2
HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
------------------
|
|->HashSet's contains method executes quickly using hash-based lookup..
Not considering the conversion from arr2
to arr2Set
,this should be O(n)
回答2:
I suspect the reason why the profiler shows the time being consumed in Count
, is that this is where the collection is actually enumerated (the Intersect
is lazily evaluated and does not run before you need the result).
I believe Intersect
should have some internal optimizations to make this reasonably fast, but you could try using a HashSet<string>
so you are sure the intersect can be made without searching through the inner array for each element:
HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
回答3:
Hmmm Intersect is probably N^2
to make it faster quicksort both arrays. and than traverse both arrays. counting intersections.
too lazy to test how fast it would be but should O(nlogn +n)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
const int arrsize = 1000000;
Random rnd = new Random(42);
string[] arr1 = new string[arrsize];
string[] arr2 = new string[arrsize];
for (int i = 0; i < arrsize; i++)
{
arr1[i] = rnd.Next().ToString();
arr2[i] = rnd.Next().ToString();
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
arr1.Intersect(arr2).Count();
Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
HashSet<string> set = new HashSet<string>(arr1);
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
SortedSet<string> set = new SortedSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
{
SortedSet<string> set = new SortedSet<string>(arr1);
var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
set.IntersectWith(arr2);
int count = set.Count;
Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
}
}
}
}
results
array 914,637
HashSet 816,119
HashSet +new 1,150,978
SortedSet +new 16,173,836
SortedSet without new 7,946,709
so seems that best way is to keep a ready hash set.
回答4:
when you are working with sets, your complexity will be O((n log n)*(m log m)) or so,
i think this here should be faster, but i'm not sure if it is now O((n log n)+(m log m))
possible would be
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();
nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();
回答5:
Build a HashSet using the smaller array and then loop through the bigger one, incrementing a counter if the item it exists in the hashset.
来源:https://stackoverflow.com/questions/13900514/fast-count-intersection-of-two-string-arrays