Removing duplicates from a list of lists

前端 未结 12 1275
萌比男神i
萌比男神i 2020-11-22 10:37

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it. Was if it

相关标签:
12条回答
  • 2020-11-22 11:07

    Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:

    >>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
    [['A', 'B'], ['A', 'A']]
    

    The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).

    0 讨论(0)
  • 2020-11-22 11:12
    >>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    >>> import itertools
    >>> k.sort()
    >>> list(k for k,_ in itertools.groupby(k))
    [[1, 2], [3], [4], [5, 6, 2]]
    

    itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

    Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it's so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that's pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one's apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

    Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it's easier to use it at a shell prompt. For example, here's a short module to showcase the general approach for this problem, save it as nodup.py:

    import itertools
    
    k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    
    def doset(k, map=map, list=list, set=set, tuple=tuple):
      return map(list, set(map(tuple, k)))
    
    def dosort(k, sorted=sorted, xrange=xrange, len=len):
      ks = sorted(k)
      return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
    
    def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
      ks = sorted(k)
      return [i for i, _ in itertools.groupby(ks)]
    
    def donewk(k):
      newk = []
      for i in k:
        if i not in newk:
          newk.append(i)
      return newk
    
    # sanity check that all functions compute the same result and don't alter k
    if __name__ == '__main__':
      savek = list(k)
      for f in doset, dosort, dogroupby, donewk:
        resk = f(k)
        assert k == savek
        print '%10s %s' % (f.__name__, sorted(resk))
    

    Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

    Now we can run checks on the tiny example list:

    $ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
    100000 loops, best of 3: 11.7 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
    100000 loops, best of 3: 9.68 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
    100000 loops, best of 3: 8.74 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
    100000 loops, best of 3: 4.44 usec per loop
    

    confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

    $ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
    10000 loops, best of 3: 25.4 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
    10000 loops, best of 3: 23.7 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
    10000 loops, best of 3: 31.3 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
    10000 loops, best of 3: 25 usec per loop
    

    the quadratic approach isn't bad, but the sort and groupby ones are better. Etc, etc.

    If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it's worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

    It's also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program's performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

    0 讨论(0)
  • 2020-11-22 11:15

    Doing it manually, creating a new k list and adding entries not found so far:

    k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    new_k = []
    for elem in k:
        if elem not in new_k:
            new_k.append(elem)
    k = new_k
    print k
    # prints [[1, 2], [4], [5, 6, 2], [3]]
    

    Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it's quadratic in complexity as you're searching the whole of new_k for each element.

    0 讨论(0)
  • 2020-11-22 11:16

    All the set-related solutions to this problem thus far require creating an entire set before iteration.

    It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a "seen" set. Then only yield a list if it is not found in this tracker set.

    This unique_everseen recipe is available in the itertools docs. It's also available in the 3rd party toolz library:

    from toolz import unique
    
    k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    
    # lazy iterator
    res = map(list, unique(map(tuple, k)))
    
    print(list(res))
    
    [[1, 2], [4], [5, 6, 2], [3]]
    

    Note that tuple conversion is necessary because lists are not hashable.

    0 讨论(0)
  • 2020-11-22 11:16

    This should work.

    k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    
    k_cleaned = []
    for ele in k:
        if set(ele) not in [set(x) for x in k_cleaned]:
            k_cleaned.append(ele)
    print(k_cleaned)
    
    # output: [[1, 2], [4], [5, 6, 2], [3]]
    
    0 讨论(0)
  • 2020-11-22 11:19
    k=[[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [3], [8], [9]]
    kl=[]
    kl.extend(x for x in k if x not in kl)
    k=list(kl)
    print(k)
    

    which prints,

    [[1, 2], [4], [5, 6, 2], [3], [5, 2], [8], [9]]
    
    0 讨论(0)
提交回复
热议问题