Efficiently remove duplicates, order-agnostic, from list of lists

后端 未结 4 969
感情败类
感情败类 2020-12-21 00:16

The following list has some duplicated sublists, with elements in different order:

l1 = [
    [\'The\', \'quick\', \'brown\', \'fox\'],
    [\'hi\', \'there\'         


        
相关标签:
4条回答
  • 2020-12-21 00:28

    This:

    l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
    s = {tuple(item) for item in map(sorted, l1)}
    l2 = [list(item) for item in s]
    

    l2 gives the list with reverse duplicates removed. Compare with: Pythonic way of removing reversed duplicates in list

    0 讨论(0)
  • 2020-12-21 00:37

    This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:

    seen = set()
    result = []
    for x in l1:
        key = tuple(sorted(x))
        if key not in seen:
            result.append(x)
            seen.add(key)
    

    The same idea in a one-liner would look like this:

    [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
    
    0 讨论(0)
  • 2020-12-21 00:41

    I did a quick benchmark, comparing the various answers:

    l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
    
    from collections import Counter
    
    def method1():
        """manually construct set, keyed on sorted tuple"""
        seen = set()
        result = []
        for x in l1:
            key = tuple(sorted(x))
            if key not in seen:
                result.append(x)
                seen.add(key)
        return result
    
    def method2():
        """frozenset-of-Counter"""
        return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())
    
    def method3():
        """wim"""
        return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]
    
    from timeit import timeit
    
    print(timeit(lambda: method1(), number=1000))
    print(timeit(lambda: method2(), number=1000))
    print(timeit(lambda: method3(), number=1000))
    

    Prints:

    0.0025010189856402576
    0.016385524009820074
    0.0026451340527273715
    
    0 讨论(0)
  • 2020-12-21 00:50

    @wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.

    To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:

    from collections import Counter
    list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]
    

    This returns:

    [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]
    
    0 讨论(0)
提交回复
热议问题