Fastest way to get sorted unique list in python?

后端未结

关注

 5  2021

谎友^ 2021-02-01 06:16

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn\'t matter whether the lis

5条回答

太阳男子 (楼主)

2021-02-01 06:32
I believe sorted(set(sequence)) is the fastest way of doing it. Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.

Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.

When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!

A small comparison of the "canonical solution" versus the groupby solution:
```
>>> import timeit
>>> sequence = list(range(500)) + list(range(700)) + list(range(1000))
>>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)
0.11532402038574219
>>> import itertools
>>> def my_sort(seq):
...     return list(k for k,_ in itertools.groupby(sorted(seq)))
... 
>>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)
0.3162040710449219
```
As you can see it's 3 times slower.

The version provided by jdm is actually even worse:
```
>>> def make_unique(lst):
...     if len(lst) <= 1:
...         return lst
...     last = lst[-1]
...     for i in range(len(lst) - 2, -1, -1):
...         item = lst[i]
...         if item == last:
...             del lst[i]
...         else:
...             last = item
... 
>>> def my_sort2(seq):
...     make_unique(sorted(seq))
... 
>>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)
0.46814608573913574
```
Almost 5 times slower. Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.

The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...