How do I find the duplicates in a list and create another list with them?

前端 未结 30 1570
梦谈多话
梦谈多话 2020-11-22 00:56

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.

相关标签:
30条回答
  • 2020-11-22 01:12

    Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets

    a=[1,2,3,3,3]
    dup=[]
    for each in a:
      if each not in dup:
        dup.append(each)
    print(dup)
    

    ======= else to get 2 separate lists of unique values and duplicate values

    a=[1,2,3,3,3]
    uniques=[]
    dups=[]
    
    for each in a:
      if each not in uniques:
        uniques.append(each)
      else:
        dups.append(each)
    print("Unique values are below:")
    print(uniques)
    print("Duplicate values are below:")
    print(dups)
    
    0 讨论(0)
  • 2020-11-22 01:12

    Here's a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

    For lists with all elements that are hashable types:

    def gen_dupes(array):
        unique = {}
        for value in array:
            if value in unique and unique[value]:
                unique[value] = False
                yield value
            else:
                unique[value] = True
    
    array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
    print(list(gen_dupes(array)))
    # => [2, 1, 6]
    

    For lists that might contain lists:

    def gen_dupes(array):
        unique = {}
        for value in array:
            is_list = False
            if type(value) is list:
                value = tuple(value)
                is_list = True
    
            if value in unique and unique[value]:
                unique[value] = False
                if is_list:
                    value = list(value)
    
                yield value
            else:
                unique[value] = True
    
    array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
    print(list(gen_dupes(array)))
    # => [2, [1, 2], 6]
    
    0 讨论(0)
  • 2020-11-22 01:13
    def removeduplicates(a):
      seen = set()
    
      for i in a:
        if i not in seen:
          seen.add(i)
      return seen 
    
    print(removeduplicates([1,1,2,2]))
    
    0 讨论(0)
  • 2020-11-22 01:15

    one-liner, for fun, and where a single statement is required.

    (lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)
    
    0 讨论(0)
  • 2020-11-22 01:16

    You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:

    def list_duplicates(seq):
      seen = set()
      seen_add = seen.add
      # adds all elements it doesn't know yet to seen and all other to seen_twice
      seen_twice = set( x for x in seq if x in seen or seen_add(x) )
      # turn the set into a list (as requested)
      return list( seen_twice )
    
    a = [1,2,3,2,1,5,6,5,5,5]
    list_duplicates(a) # yields [1, 2, 5]
    

    Just in case speed matters, here are some timings:

    # file: test.py
    import collections
    
    def thg435(l):
        return [x for x, y in collections.Counter(l).items() if y > 1]
    
    def moooeeeep(l):
        seen = set()
        seen_add = seen.add
        # adds all elements it doesn't know yet to seen and all other to seen_twice
        seen_twice = set( x for x in l if x in seen or seen_add(x) )
        # turn the set into a list (as requested)
        return list( seen_twice )
    
    def RiteshKumar(l):
        return list(set([x for x in l if l.count(x) > 1]))
    
    def JohnLaRooy(L):
        seen = set()
        seen2 = set()
        seen_add = seen.add
        seen2_add = seen2.add
        for item in L:
            if item in seen:
                seen2_add(item)
            else:
                seen_add(item)
        return list(seen2)
    
    l = [1,2,3,2,1,5,6,5,5,5]*100
    

    Here are the results: (well done @JohnLaRooy!)

    $ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
    10000 loops, best of 3: 74.6 usec per loop
    $ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
    10000 loops, best of 3: 91.3 usec per loop
    $ python -mtimeit -s 'import test' 'test.thg435(test.l)'
    1000 loops, best of 3: 266 usec per loop
    $ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
    100 loops, best of 3: 8.35 msec per loop
    

    Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

    $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
    100000 loops, best of 3: 17.8 usec per loop
    $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
    10000 loops, best of 3: 23 usec per loop
    $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
    10000 loops, best of 3: 39.3 usec per loop
    

    Apparantly this effect is related to the "duplicatedness" of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

    $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
    1000 loops, best of 3: 495 usec per loop
    $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
    1000 loops, best of 3: 499 usec per loop
    $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
    1000 loops, best of 3: 1.68 msec per loop
    
    0 讨论(0)
  • 2020-11-22 01:18
    >>> l = [1,2,3,4,4,5,5,6,1]
    >>> set([x for x in l if l.count(x) > 1])
    set([1, 4, 5])
    
    0 讨论(0)
提交回复
热议问题