What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn\'t matter whether the lis
This is just something I whipped up in a couple minutes. The function modifies a list in place, and removes consecutive repeats:
def make_unique(lst):
if len(lst) <= 1:
return lst
last = lst[-1]
for i in range(len(lst) - 2, -1, -1):
item = lst[i]
if item == last:
del lst[i]
else:
last = item
Some representative input data:
inp = [
(u"Tomato", "de"), (u"Cherry", "en"), (u"Watermelon", None), (u"Apple", None),
(u"Cucumber", "de"), (u"Lettuce", "de"), (u"Tomato", None), (u"Banana", None),
(u"Squash", "en"), (u"Rubarb", "de"), (u"Lemon", None),
]
Make sure both variants work as wanted:
print inp
print sorted(set(inp))
# copy because we want to modify it in place
inp1 = inp[:]
inp1.sort()
make_unique(inp1)
print inp1
Now to the testing. I'm not using timeit, since I don't want to time the copying of the list, only the sorting. time1
is sorted(set(...)
, time2
is list.sort()
followed by make_unique
, and time3
is the solution with itertools.groupby
by Avinash Y.
import time
def time1(number):
total = 0
for i in range(number):
start = time.clock()
sorted(set(inp))
total += time.clock() - start
return total
def time2(number):
total = 0
for i in range(number):
inp1 = inp[:]
start = time.clock()
inp1.sort()
make_unique(inp1)
total += time.clock() - start
return total
import itertools
def time3(number):
total = 0
for i in range(number):
start = time.clock()
list(k for k,_ in itertools.groupby(sorted(inp)))
total += time.clock() - start
return total
sort + make_unique
is approximately as fast as sorted(set(...))
. I'd have to do a couple more iterations to see which one is potentially faster, but within the variations they are very similar. The itertools
version is a bit slower.
# done each 3 times
print time1(100000)
# 2.38, 3.01, 2.59
print time2(100000)
# 2.88, 2.37, 2.6
print time3(100000)
# 4.18, 4.44, 4.67
Now with a larger list (the + str(i)
is to prevent duplicates):
old_inp = inp[:]
inp = []
for i in range(100):
for j in old_inp:
inp.append((j[0] + str(i), j[1]))
print time1(10000)
# 40.37
print time2(10000)
# 35.09
print time3(10000)
# 40.0
Note that if there are a lot of duplicates in the list, the first version is much faster (since it does less sorting).
inp = []
for i in range(100):
for j in old_inp:
#inp.append((j[0] + str(i), j[1]))
inp.append((j[0], j[1]))
print time1(10000)
# 3.52
print time2(10000)
# 26.33
print time3(10000)
# 20.5