Efficiently merge string arrays in .NET, keeping distinct values

前端 未结 6 1626
迷失自我
迷失自我 2021-02-02 06:38

I\'m using .NET 3.5. I have two string arrays, which may share one or more values:

string[] list1 = new string[] { \"apple\", \"orange\", \"banana\" };
string[]         


        
6条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-02 06:49

    Disclaimer This is premature optimization. For your example arrays, use the 3.5 extension methods. Until you know you have a performance problem in this region, you should use library code.


    If you can sort the arrays, or they're sorted when you get to that point in the code, you can use the following methods.

    These will pull one item from both, and produce the "lowest" item, then fetch a new item from the corresponding source, until both sources are exhausted. In the case where the current item fetched from the two sources are equal, it will produce the one from the first source, and skip them in both sources.

    private static IEnumerable Merge(IEnumerable source1,
        IEnumerable source2)
    {
        return Merge(source1, source2, Comparer.Default);
    }
    
    private static IEnumerable Merge(IEnumerable source1,
        IEnumerable source2, IComparer comparer)
    {
        #region Parameter Validation
    
        if (Object.ReferenceEquals(null, source1))
            throw new ArgumentNullException("source1");
        if (Object.ReferenceEquals(null, source2))
            throw new ArgumentNullException("source2");
        if (Object.ReferenceEquals(null, comparer))
            throw new ArgumentNullException("comparer");
    
        #endregion
    
        using (IEnumerator
            enumerator1 = source1.GetEnumerator(),
            enumerator2 = source2.GetEnumerator())
        {
            Boolean more1 = enumerator1.MoveNext();
            Boolean more2 = enumerator2.MoveNext();
    
            while (more1 && more2)
            {
                Int32 comparisonResult = comparer.Compare(
                    enumerator1.Current,
                    enumerator2.Current);
                if (comparisonResult < 0)
                {
                    // enumerator 1 has the "lowest" item
                    yield return enumerator1.Current;
                    more1 = enumerator1.MoveNext();
                }
                else if (comparisonResult > 0)
                {
                    // enumerator 2 has the "lowest" item
                    yield return enumerator2.Current;
                    more2 = enumerator2.MoveNext();
                }
                else
                {
                    // they're considered equivalent, only yield it once
                    yield return enumerator1.Current;
                    more1 = enumerator1.MoveNext();
                    more2 = enumerator2.MoveNext();
                }
            }
    
            // Yield rest of values from non-exhausted source
            while (more1)
            {
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            while (more2)
            {
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
        }
    }
    

    Note that if one of the sources contains duplicates, you might see duplicates in the output. If you want to remove these duplicates in the already sorted lists, use the following method:

    private static IEnumerable CheapDistinct(IEnumerable source)
    {
        return CheapDistinct(source, Comparer.Default);
    }
    
    private static IEnumerable CheapDistinct(IEnumerable source,
        IComparer comparer)
    {
        #region Parameter Validation
    
        if (Object.ReferenceEquals(null, source))
            throw new ArgumentNullException("source");
        if (Object.ReferenceEquals(null, comparer))
            throw new ArgumentNullException("comparer");
    
        #endregion
    
        using (IEnumerator enumerator = source.GetEnumerator())
        {
            if (enumerator.MoveNext())
            {
                T item = enumerator.Current;
    
                // scan until different item found, then produce
                // the previous distinct item
                while (enumerator.MoveNext())
                {
                    if (comparer.Compare(item, enumerator.Current) != 0)
                    {
                        yield return item;
                        item = enumerator.Current;
                    }
                }
    
                // produce last item that is left over from above loop
                yield return item;
            }
        }
    }
    

    Note that none of these will internally use a data structure to keep a copy of the data, so they will be cheap if the input is sorted. If you can't, or won't, guarantee that, you should use the 3.5 extension methods that you've already found.

    Here's example code that calls the above methods:

    String[] list_1 = { "apple", "orange", "apple", "banana" };
    String[] list_2 = { "banana", "pear", "grape" };
    
    Array.Sort(list_1);
    Array.Sort(list_2);
    
    IEnumerable items = Merge(
        CheapDistinct(list_1),
        CheapDistinct(list_2));
    foreach (String item in items)
        Console.Out.WriteLine(item);
    

提交回复
热议问题