Efficiently merge string arrays in .NET, keeping distinct values

前端 未结 6 1623
迷失自我
迷失自我 2021-02-02 06:38

I\'m using .NET 3.5. I have two string arrays, which may share one or more values:

string[] list1 = new string[] { \"apple\", \"orange\", \"banana\" };
string[]         


        
相关标签:
6条回答
  • 2021-02-02 06:48

    Probably creating a hashtable with your values as keys (only adding those not already present) and then converting the keys to an array could be a viable solution.

    0 讨论(0)
  • 2021-02-02 06:49

    Disclaimer This is premature optimization. For your example arrays, use the 3.5 extension methods. Until you know you have a performance problem in this region, you should use library code.


    If you can sort the arrays, or they're sorted when you get to that point in the code, you can use the following methods.

    These will pull one item from both, and produce the "lowest" item, then fetch a new item from the corresponding source, until both sources are exhausted. In the case where the current item fetched from the two sources are equal, it will produce the one from the first source, and skip them in both sources.

    private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
        IEnumerable<T> source2)
    {
        return Merge(source1, source2, Comparer<T>.Default);
    }
    
    private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
        IEnumerable<T> source2, IComparer<T> comparer)
    {
        #region Parameter Validation
    
        if (Object.ReferenceEquals(null, source1))
            throw new ArgumentNullException("source1");
        if (Object.ReferenceEquals(null, source2))
            throw new ArgumentNullException("source2");
        if (Object.ReferenceEquals(null, comparer))
            throw new ArgumentNullException("comparer");
    
        #endregion
    
        using (IEnumerator<T>
            enumerator1 = source1.GetEnumerator(),
            enumerator2 = source2.GetEnumerator())
        {
            Boolean more1 = enumerator1.MoveNext();
            Boolean more2 = enumerator2.MoveNext();
    
            while (more1 && more2)
            {
                Int32 comparisonResult = comparer.Compare(
                    enumerator1.Current,
                    enumerator2.Current);
                if (comparisonResult < 0)
                {
                    // enumerator 1 has the "lowest" item
                    yield return enumerator1.Current;
                    more1 = enumerator1.MoveNext();
                }
                else if (comparisonResult > 0)
                {
                    // enumerator 2 has the "lowest" item
                    yield return enumerator2.Current;
                    more2 = enumerator2.MoveNext();
                }
                else
                {
                    // they're considered equivalent, only yield it once
                    yield return enumerator1.Current;
                    more1 = enumerator1.MoveNext();
                    more2 = enumerator2.MoveNext();
                }
            }
    
            // Yield rest of values from non-exhausted source
            while (more1)
            {
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            while (more2)
            {
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
        }
    }
    

    Note that if one of the sources contains duplicates, you might see duplicates in the output. If you want to remove these duplicates in the already sorted lists, use the following method:

    private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
    {
        return CheapDistinct<T>(source, Comparer<T>.Default);
    }
    
    private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
        IComparer<T> comparer)
    {
        #region Parameter Validation
    
        if (Object.ReferenceEquals(null, source))
            throw new ArgumentNullException("source");
        if (Object.ReferenceEquals(null, comparer))
            throw new ArgumentNullException("comparer");
    
        #endregion
    
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            if (enumerator.MoveNext())
            {
                T item = enumerator.Current;
    
                // scan until different item found, then produce
                // the previous distinct item
                while (enumerator.MoveNext())
                {
                    if (comparer.Compare(item, enumerator.Current) != 0)
                    {
                        yield return item;
                        item = enumerator.Current;
                    }
                }
    
                // produce last item that is left over from above loop
                yield return item;
            }
        }
    }
    

    Note that none of these will internally use a data structure to keep a copy of the data, so they will be cheap if the input is sorted. If you can't, or won't, guarantee that, you should use the 3.5 extension methods that you've already found.

    Here's example code that calls the above methods:

    String[] list_1 = { "apple", "orange", "apple", "banana" };
    String[] list_2 = { "banana", "pear", "grape" };
    
    Array.Sort(list_1);
    Array.Sort(list_2);
    
    IEnumerable<String> items = Merge(
        CheapDistinct(list_1),
        CheapDistinct(list_2));
    foreach (String item in items)
        Console.Out.WriteLine(item);
    
    0 讨论(0)
  • 2021-02-02 06:56

    .NET 3.5 introduced the HashSet class which could do this:

    IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);
    

    Not sure of performance, but it should beat the Linq example you gave.

    EDIT: I stand corrected. The lazy implementation of Concat and Distinct have a key memory AND speed advantage. Concat/Distinct is about 10% faster, and saves multiple copies of data.

    I confirmed through code:

    Setting up arrays of 3000000 strings overlapping by 300000
    Starting Hashset...
    HashSet: 00:00:02.8237616
    Starting Concat/Distinct...
    Concat/Distinct: 00:00:02.5629681
    

    is the output of:

            int num = 3000000;
            int num10Pct = (int)(num / 10);
    
            Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
            string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
            string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();
    
            Console.WriteLine("Starting Hashset...");
            Stopwatch sw = new Stopwatch();
            sw.Start();
            string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
            sw.Stop();
            Console.WriteLine("HashSet: " + sw.Elapsed);
    
            Console.WriteLine("Starting Concat/Distinct...");
            sw.Reset();
            sw.Start();
            string[] merged2 = list1.Concat(list2).Distinct().ToArray();
            sw.Stop();
            Console.WriteLine("Concat/Distinct: " + sw.Elapsed);
    
    0 讨论(0)
  • 2021-02-02 07:01

    Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.

    I'm not sure how you'd manage to make it more efficient than that in a general way :)

    EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.

    The effect is the equivalent of:

    public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
    {
        HashSet<T> returned = new HashSet<T>();
        foreach (T element in first)
        {
            if (returned.Add(element))
            {
                yield return element;
            }
        }
        foreach (T element in second)
        {
            if (returned.Add(element))
            {
                yield return element;
            }
        }
    }
    
    0 讨论(0)
  • 2021-02-02 07:02

    You don't know which approach is faster until you measure it. The LINQ way is elegant and easy to understand.

    Another way is to implement an set as an hash array (Dictionary) and add all the elements of both the arrays to the set. Then use set.Keys.ToArray() method to create the resulting array.

    0 讨论(0)
  • 2021-02-02 07:03
    string[] result = list1.Union(list2).ToArray();
    

    from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."

    0 讨论(0)
提交回复
热议问题