Finding the difference between two lists of strings

前端 未结 3 1443
挽巷
挽巷 2021-01-21 07:02

I\'m pretty sure this is a duplicate, but I have tried everything, and I still cannot seem to get the differences. I have two lists of strings: listA and listB. I\'m trying to

3条回答
  •  被撕碎了的回忆
    2021-01-21 07:32

    All code you posted should work fine so error is in another place anyway you write "these take a really long time" then I suppose you have a performance issue.

    Let's do a very quick and dirty comparison (you know to do a good performance test is a long process, self-promotion: benchmark has been done with this free tool). Assumptions:

    • Lists are unordered.
    • There may be duplicates in our inputs but we don't want duplicates in result.
    • Second list is always a subset of first list (assumed because you're using SymmetricExceptWith and if not then its result is pretty different compared to Except). If it was a mistake just ignore tests for SymmetricExceptWith.

    Two lists of 20,000 random items (test repeated 100 times then averaged, release mode).

    Method                  Time [ms]
    Contains *1                  49.4
    Contains *2                  49.0
    Except                        5.9
    SymmetricExceptWith *3        4.1
    SymmetricExceptWith *4        2.5
    

    Notes:

    1 Loop with foreach
    2 Loop with for
    3 Hashset creation measured
    4 Hashset creation not measured. I included this for reference but if you don't have first list as Hashset you can't ignore creation time.

    We see Contains() method is pretty slow so we can drop it in bigger tests (anyway I checked and its performance won't become better or even comparable). Let's see what will happen for 1,000,000 items list.

    Method                        Time [ms]
    Except                            244.4
    SymmetricExceptWith               259.0
    

    Let's try to make it parallel (please note that for this test I'm using a old Core 2 Duo 2 GHz):

    Method                        Time [ms]
    Except                            244.4
    SymmetricExceptWith               259.0
    Except (parallel partitions)      301.8
    SymmetricExceptWith (p. p.)       382.6
    Except (AsParallel)               274.4
    

    Parallel performance are worse and LINQ Except is best option now. Let's see how it works on a better CPU (Xeon 2.8 GHz, quad core). Also note that with such big amount of data cache size won't affect testing too much.

    Method                        Time [ms]
    Except                            127.4
    SymmetricExceptWith               149.2
    Except (parallel partitions)      208.0
    SymmetricExceptWith (p. p.)       170.0
    Except (AsParallel)                80.2
    

    To summarize: for relatively small lists SymmetricExceptWith() will perform better, for big lists Except() is always better. If you're targeting a modern multi-core CPU then parallel implementation will scale much better. In code:

    var c = a.Except(b).ToList();
    var c = a.AsParallel().Except(b.AsParallel()).ToList();
    

    Please note that if you don't need List as result and IEnumerable is enough then performance will greatly increase (and difference with parallel execution will be higher).

    Of course those two lines of code are not optimal and can be greatly increase (and if it's really performance critical you may pick ParallelEnumerable.Except() implementation as starting point for your own specific highly optimized routine).

提交回复
热议问题