Fastest way to get sorted unique list in python?

后端 未结 5 2020
谎友^
谎友^ 2021-02-01 06:16

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn\'t matter whether the lis

5条回答
  •  太阳男子
    2021-02-01 06:32

    This is just something I whipped up in a couple minutes. The function modifies a list in place, and removes consecutive repeats:

    def make_unique(lst):
        if len(lst) <= 1:
            return lst
        last = lst[-1]
        for i in range(len(lst) - 2, -1, -1):
            item = lst[i]
            if item == last:
                del lst[i]
            else:
                last = item
    

    Some representative input data:

    inp = [
    (u"Tomato", "de"), (u"Cherry", "en"), (u"Watermelon", None), (u"Apple", None),
    (u"Cucumber", "de"), (u"Lettuce", "de"), (u"Tomato", None), (u"Banana", None),
    (u"Squash", "en"), (u"Rubarb", "de"), (u"Lemon", None),
    ]
    

    Make sure both variants work as wanted:

    print inp
    print sorted(set(inp))
    # copy because we want to modify it in place
    inp1 = inp[:]
    inp1.sort()
    make_unique(inp1)
    print inp1
    

    Now to the testing. I'm not using timeit, since I don't want to time the copying of the list, only the sorting. time1 is sorted(set(...), time2 is list.sort() followed by make_unique, and time3 is the solution with itertools.groupby by Avinash Y.

    import time
    def time1(number):
        total = 0
        for i in range(number):
            start = time.clock()
            sorted(set(inp))
            total += time.clock() - start
        return total
    
    def time2(number):
        total = 0
        for i in range(number):
            inp1 = inp[:]
            start = time.clock()
            inp1.sort()
            make_unique(inp1)
            total += time.clock() - start
        return total
    
    import itertools 
    
    def time3(number): 
        total = 0 
        for i in range(number): 
            start = time.clock() 
            list(k for k,_ in itertools.groupby(sorted(inp))) 
            total += time.clock() - start 
        return total
    

    sort + make_unique is approximately as fast as sorted(set(...)). I'd have to do a couple more iterations to see which one is potentially faster, but within the variations they are very similar. The itertools version is a bit slower.

    # done each 3 times
    print time1(100000)
    # 2.38, 3.01, 2.59
    print time2(100000)
    # 2.88, 2.37, 2.6
    print time3(100000)
    # 4.18, 4.44, 4.67
    

    Now with a larger list (the + str(i) is to prevent duplicates):

    old_inp = inp[:]
    inp = []
    for i in range(100):
        for j in old_inp:
            inp.append((j[0] + str(i), j[1]))
    
    print time1(10000)
    # 40.37
    print time2(10000)
    # 35.09
    print time3(10000)
    # 40.0
    

    Note that if there are a lot of duplicates in the list, the first version is much faster (since it does less sorting).

    inp = []
    for i in range(100):
        for j in old_inp:
            #inp.append((j[0] + str(i), j[1]))
            inp.append((j[0], j[1]))
    
    print time1(10000)
    # 3.52
    print time2(10000)
    # 26.33
    print time3(10000)
    # 20.5
    

提交回复
热议问题