How do I find the duplicates in a list and create another list with them?

前端 未结 30 1629
梦谈多话 2020-11-22 00:56

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.

  • 2020-11-22 01:05

    I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that's the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as 'target'


    Now if we want to get the duplicates,we can use the one liner as below:

        duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

    This code will put the duplicated records as key and count as value in to the dictionary 'duplicates'.'duplicate' dictionary will look like as below:

        {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

    If you just want all the records with duplicates alone in a list, its again much shorter code:

        duplicates=filter(lambda rec : target.count(rec)>1,target)

    Output will be:

        [3, 4, 4, 4, 3, 4, 3]

    This works perfectly in python 2.7.x + versions

    0 讨论(0)
  • 2020-11-22 01:06

    One line solution:

    set([i for i in list if sum([1 for a in list if a == i]) > 1])
    0 讨论(0)
  • 2020-11-22 01:07

    the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

    number_lst = [1, 1, 2, 3, 5, ...]
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    unique_set = seen_set - duplicate_set
    0 讨论(0)
  • 2020-11-22 01:08

    I came across this question whilst looking in to something related - and wonder why no-one offered a generator based solution? Solving this problem would be:

    >>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
    [1, 2, 5]

    I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

    I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists... Now includes pandas approach from @firelynx -- slow, but not horribly so, and simple. Note - sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.


    • very quick simple to test for 'any' duplicates using the same code


    • Duplicates should be reported once only
    • Duplicate order does not need to be preserved
    • Duplicate might be anywhere in the list

    Fastest solution, 1m entries:

    def getDupes(c):
            a, b = itertools.tee(sorted(c))
            next(b, None)
            r = None
            for k, g in itertools.izip(a, b):
                if k != g: continue
                if k != r:
                    yield k
                    r = k

    Approaches tested

    import itertools
    import time
    import random
    def getDupes_1(c):
        for i in xrange(0, len(c)):
            if c[i] in c[:i]:
                yield c[i]
    def getDupes_2(c):
        '''set len change'''
        s = set()
        for i in c:
            l = len(s)
            if len(s) == l:
                yield i
    def getDupes_3(c):
        '''in dict'''
        d = {}
        for i in c:
            if i in d:
                if d[i]:
                    yield i
                    d[i] = False
                d[i] = True
    def getDupes_4(c):
        '''in set'''
        s,r = set(),set()
        for i in c:
            if i not in s:
            elif i not in r:
                yield i
    def getDupes_5(c):
        c = sorted(c)
        r = None
        for i in xrange(1, len(c)):
            if c[i] == c[i - 1]:
                if c[i] != r:
                    yield c[i]
                    r = c[i]
    def getDupes_6(c):
        def multiple(x):
                return True
                return False
        for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
            yield k
    def getDupes_7(c):
        c = sorted(c)
        r = None
        for k, g in zip(c[:-1],c[1:]):
            if k == g:
                if k != r:
                    yield k
                    r = k
    def getDupes_8(c):
        c = sorted(c)
        r = None
        for k, g in itertools.izip(c[:-1],c[1:]):
            if k == g:
                if k != r:
                    yield k
                    r = k
    def getDupes_9(c):
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k
    def getDupes_a(l):
        seen = set()
        seen_add = seen.add
        # adds all elements it doesn't know yet to seen and all other to seen_twice
        for x in l:
            if x in seen or seen_add(x):
                yield x
    def getDupes_b(x):
        x = sorted(x)
        def _matches():
            for k,g in itertools.izip(x[:-1],x[1:]):
                if k == g:
                    yield k
        for k, n in itertools.groupby(_matches()):
            yield k
    def getDupes_c(a):
        import pandas as pd
        vc = pd.Series(a).value_counts()
        i = vc[vc > 1].index
        for _ in i:
            yield _
    def hasDupes(fn,c):
            if fn(c).next(): return True    # Found a dupe
        except StopIteration:
        return False
    def getDupes(fn,c):
        return list(fn(c))
    STABLE = True
    if STABLE:
        print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
        print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
    for location in (50,250000,500000,750000,999999):
        for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                     getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
            print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
            deltas = []
            for FIRST in (True,False):
                for i in xrange(0, 5):
                    c = range(0,1000000)
                    if STABLE:
                        c[0] = location
                    start = time.time()
                    if FIRST:
                        print '.' if location == test(c).next() else '!',
                        print '.' if [location] == list(test(c)) else '!',
                print ' -- %0.3f  '%(sum(deltas)/len(deltas)),

    The results for the 'all dupes' test were consistent, finding "first" duplicate then "all" duplicates in this array:

    Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
    Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
    Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
    Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
    Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
    Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
    Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
    Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
    Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
    Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
    Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

    When the lists are shuffled first, the price of the sort becomes apparent - the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

    Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
    Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
    Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
    Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
    Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
    Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
    Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
    Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
    Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
    Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
    Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  
    0 讨论(0)
  • 2020-11-22 01:09

    We can use itertools.groupby in order to find all the items that have dups:

    from itertools import groupby
    myList  = [2, 4, 6, 8, 4, 6, 12]
    # when the list is sorted, groupby groups by consecutive elements which are similar
    for x, y in groupby(sorted(myList)):
        #  list(y) returns all the occurences of item x
        if len(list(y)) > 1:
            print x  

    The output will be:

    0 讨论(0)
  • 2020-11-22 01:09

    I guess the most effective way to find duplicates in a list is:

    from collections import Counter
    def duplicates(values):
        dups = Counter(values) - Counter(set(values))
        return list(dups.keys())

    It uses Counter once on all the elements, and then on all unique elements. Subtracting the first one with the second will leave out the duplicates only.

    0 讨论(0)