C# fastest intersection of 2 sets of sorted numbers

会有一股神秘感。 提交于 2019-11-30 02:12:45

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.

Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.

The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.

Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):

(There's more text at the end of the answer!)

static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
    IEnumerable<T> sequence2,
    IComparer<T> comparer)
{
    using (var cursor1 = sequence1.GetEnumerator())
    using (var cursor2 = sequence2.GetEnumerator())
    {
        if (!cursor1.MoveNext() || !cursor2.MoveNext())
        {
            yield break;
        }
        var value1 = cursor1.Current;
        var value2 = cursor2.Current;

        while (true)
        {
            int comparison = comparer.Compare(value1, value2);
            if (comparison < 0)
            {
                if (!cursor1.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
            }
            else if (comparison > 0)
            {
                if (!cursor2.MoveNext())
                {
                    yield break;
                }
                value2 = cursor2.Current;
            }
            else
            {
                yield return value1;
                if (!cursor1.MoveNext() || !cursor2.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
                value2 = cursor2.Current;
            }
        }
    }
}

EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).

This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.

var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();

while (firstIndex < firstCount && secondIndex < secondCount)
{
    var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
    if (comp < 0) {
        ++firstIndex;
    }
    else if (comp > 0) {
        ++secondIndex;
    }
    else {
        intersection.Add(firstSet[firstIndex]);
        ++firstIndex;
        ++secondIndex;
    }
}

The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.

You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.

var intersection = firstSet.Intersect(secondSet);

Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).

I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.

I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:

public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
        List<T> sequence2,
        IComparer<T> comparer)
{
    List<T> smallList = null;
    List<T> largeList = null;

    if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
    {
        smallList = sequence1;
        largeList = sequence2;
    }
    else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
    {
        smallList = sequence2;
        largeList = sequence1;
    }

    if (smallList != null)
    {
        foreach (var item in smallList)
        {
            if (largeList.BinarySearch(item, comparer) >= 0)
            {
                yield return item;
            }
        }
    }
    else
    {
        //Use Jon's method
    }
}

I am still unsure about the point at which you break even, need to do some more testing

Yahia

try

firstSet.InterSect (secondSet).ToList ()

or

firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!