Fastest way to get sorted unique list in python?

后端 未结 5 2015
谎友^
谎友^ 2021-02-01 06:16

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn\'t matter whether the lis

相关标签:
5条回答
  • 2021-02-01 06:32

    I believe sorted(set(sequence)) is the fastest way of doing it. Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.

    Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.

    When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!

    A small comparison of the "canonical solution" versus the groupby solution:

    >>> import timeit
    >>> sequence = list(range(500)) + list(range(700)) + list(range(1000))
    >>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)
    0.11532402038574219
    >>> import itertools
    >>> def my_sort(seq):
    ...     return list(k for k,_ in itertools.groupby(sorted(seq)))
    ... 
    >>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)
    0.3162040710449219
    

    As you can see it's 3 times slower.

    The version provided by jdm is actually even worse:

    >>> def make_unique(lst):
    ...     if len(lst) <= 1:
    ...         return lst
    ...     last = lst[-1]
    ...     for i in range(len(lst) - 2, -1, -1):
    ...         item = lst[i]
    ...         if item == last:
    ...             del lst[i]
    ...         else:
    ...             last = item
    ... 
    >>> def my_sort2(seq):
    ...     make_unique(sorted(seq))
    ... 
    >>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)
    0.46814608573913574
    

    Almost 5 times slower. Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.

    The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.

    0 讨论(0)
  • 2021-02-01 06:32

    This is just something I whipped up in a couple minutes. The function modifies a list in place, and removes consecutive repeats:

    def make_unique(lst):
        if len(lst) <= 1:
            return lst
        last = lst[-1]
        for i in range(len(lst) - 2, -1, -1):
            item = lst[i]
            if item == last:
                del lst[i]
            else:
                last = item
    

    Some representative input data:

    inp = [
    (u"Tomato", "de"), (u"Cherry", "en"), (u"Watermelon", None), (u"Apple", None),
    (u"Cucumber", "de"), (u"Lettuce", "de"), (u"Tomato", None), (u"Banana", None),
    (u"Squash", "en"), (u"Rubarb", "de"), (u"Lemon", None),
    ]
    

    Make sure both variants work as wanted:

    print inp
    print sorted(set(inp))
    # copy because we want to modify it in place
    inp1 = inp[:]
    inp1.sort()
    make_unique(inp1)
    print inp1
    

    Now to the testing. I'm not using timeit, since I don't want to time the copying of the list, only the sorting. time1 is sorted(set(...), time2 is list.sort() followed by make_unique, and time3 is the solution with itertools.groupby by Avinash Y.

    import time
    def time1(number):
        total = 0
        for i in range(number):
            start = time.clock()
            sorted(set(inp))
            total += time.clock() - start
        return total
    
    def time2(number):
        total = 0
        for i in range(number):
            inp1 = inp[:]
            start = time.clock()
            inp1.sort()
            make_unique(inp1)
            total += time.clock() - start
        return total
    
    import itertools 
    
    def time3(number): 
        total = 0 
        for i in range(number): 
            start = time.clock() 
            list(k for k,_ in itertools.groupby(sorted(inp))) 
            total += time.clock() - start 
        return total
    

    sort + make_unique is approximately as fast as sorted(set(...)). I'd have to do a couple more iterations to see which one is potentially faster, but within the variations they are very similar. The itertools version is a bit slower.

    # done each 3 times
    print time1(100000)
    # 2.38, 3.01, 2.59
    print time2(100000)
    # 2.88, 2.37, 2.6
    print time3(100000)
    # 4.18, 4.44, 4.67
    

    Now with a larger list (the + str(i) is to prevent duplicates):

    old_inp = inp[:]
    inp = []
    for i in range(100):
        for j in old_inp:
            inp.append((j[0] + str(i), j[1]))
    
    print time1(10000)
    # 40.37
    print time2(10000)
    # 35.09
    print time3(10000)
    # 40.0
    

    Note that if there are a lot of duplicates in the list, the first version is much faster (since it does less sorting).

    inp = []
    for i in range(100):
        for j in old_inp:
            #inp.append((j[0] + str(i), j[1]))
            inp.append((j[0], j[1]))
    
    print time1(10000)
    # 3.52
    print time2(10000)
    # 26.33
    print time3(10000)
    # 20.5
    
    0 讨论(0)
  • 2021-02-01 06:49

    Maybe this is not the answer you are searching for, but anyway, you should take this into your consideration.

    Basically, you have 2 operations on a list:

    unique_list = set(your_list)       # O(n) complexity
    sorted_list = sorted(unique_list)  # O(nlogn) complexity
    

    Now, you say "it seems to me that first checking for uniqueness and then sorting is wasteful", and you are right. But, how bad really is that redundant step? Take n = 1000000:

    # sorted(set(a_list))
    O(n) => 1000000
    o(nlogn) => 1000000 * 20 = 20000000
    Total => 21000000
    
    # Your fastest way
    O(nlogn) => 20000000
    Total: 20000000
    

    Speed gain: (1 - 20000000/21000000) * 100 = 4.76 %

    For n = 5000000, speed gain: ~1.6 %

    Now, is that optimization worth it?

    0 讨论(0)
  • 2021-02-01 06:54
    >>> import itertools
    >>> a=[2,3,4,1,2,7,8,3]
    >>> list(k for k,_ in itertools.groupby(sorted(a)))
    [1, 2, 3, 4, 7, 8]
    
    0 讨论(0)
  • 2021-02-01 06:56
    import numpy as np
    np.unique(...)
    

    The np.unique function returns an ndarray unique and sorted based on an array-like parameter. This will work with any numpy types, but also regular python values that are orderable.

    If you need a regular python list, use np.unique(...).tolist()

    0 讨论(0)
提交回复
热议问题