How do I find the duplicates in a list and create another list with them?

前端 未结 30 1572
梦谈多话
梦谈多话 2020-11-22 00:56

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.

相关标签:
30条回答
  • 2020-11-22 01:05

    I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that's the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as 'target'

        target=[1,2,3,4,4,4,3,5,6,8,4,3]
    

    Now if we want to get the duplicates,we can use the one liner as below:

        duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))
    

    This code will put the duplicated records as key and count as value in to the dictionary 'duplicates'.'duplicate' dictionary will look like as below:

        {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times
    

    If you just want all the records with duplicates alone in a list, its again much shorter code:

        duplicates=filter(lambda rec : target.count(rec)>1,target)
    

    Output will be:

        [3, 4, 4, 4, 3, 4, 3]
    

    This works perfectly in python 2.7.x + versions

    0 讨论(0)
  • 2020-11-22 01:06

    One line solution:

    set([i for i in list if sum([1 for a in list if a == i]) > 1])
    
    0 讨论(0)
  • 2020-11-22 01:07

    the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

    number_lst = [1, 1, 2, 3, 5, ...]
    
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    unique_set = seen_set - duplicate_set
    
    0 讨论(0)
  • 2020-11-22 01:08

    I came across this question whilst looking in to something related - and wonder why no-one offered a generator based solution? Solving this problem would be:

    >>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
    [1, 2, 5]
    

    I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

    I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists... Now includes pandas approach from @firelynx -- slow, but not horribly so, and simple. Note - sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

    Advantages

    • very quick simple to test for 'any' duplicates using the same code

    Assumptions

    • Duplicates should be reported once only
    • Duplicate order does not need to be preserved
    • Duplicate might be anywhere in the list

    Fastest solution, 1m entries:

    def getDupes(c):
            '''sort/tee/izip'''
            a, b = itertools.tee(sorted(c))
            next(b, None)
            r = None
            for k, g in itertools.izip(a, b):
                if k != g: continue
                if k != r:
                    yield k
                    r = k
    

    Approaches tested

    import itertools
    import time
    import random
    
    def getDupes_1(c):
        '''naive'''
        for i in xrange(0, len(c)):
            if c[i] in c[:i]:
                yield c[i]
    
    def getDupes_2(c):
        '''set len change'''
        s = set()
        for i in c:
            l = len(s)
            s.add(i)
            if len(s) == l:
                yield i
    
    def getDupes_3(c):
        '''in dict'''
        d = {}
        for i in c:
            if i in d:
                if d[i]:
                    yield i
                    d[i] = False
            else:
                d[i] = True
    
    def getDupes_4(c):
        '''in set'''
        s,r = set(),set()
        for i in c:
            if i not in s:
                s.add(i)
            elif i not in r:
                r.add(i)
                yield i
    
    def getDupes_5(c):
        '''sort/adjacent'''
        c = sorted(c)
        r = None
        for i in xrange(1, len(c)):
            if c[i] == c[i - 1]:
                if c[i] != r:
                    yield c[i]
                    r = c[i]
    
    def getDupes_6(c):
        '''sort/groupby'''
        def multiple(x):
            try:
                x.next()
                x.next()
                return True
            except:
                return False
        for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
            yield k
    
    def getDupes_7(c):
        '''sort/zip'''
        c = sorted(c)
        r = None
        for k, g in zip(c[:-1],c[1:]):
            if k == g:
                if k != r:
                    yield k
                    r = k
    
    def getDupes_8(c):
        '''sort/izip'''
        c = sorted(c)
        r = None
        for k, g in itertools.izip(c[:-1],c[1:]):
            if k == g:
                if k != r:
                    yield k
                    r = k
    
    def getDupes_9(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k
    
    def getDupes_a(l):
        '''moooeeeep'''
        seen = set()
        seen_add = seen.add
        # adds all elements it doesn't know yet to seen and all other to seen_twice
        for x in l:
            if x in seen or seen_add(x):
                yield x
    
    def getDupes_b(x):
        '''iter*/sorted'''
        x = sorted(x)
        def _matches():
            for k,g in itertools.izip(x[:-1],x[1:]):
                if k == g:
                    yield k
        for k, n in itertools.groupby(_matches()):
            yield k
    
    def getDupes_c(a):
        '''pandas'''
        import pandas as pd
        vc = pd.Series(a).value_counts()
        i = vc[vc > 1].index
        for _ in i:
            yield _
    
    def hasDupes(fn,c):
        try:
            if fn(c).next(): return True    # Found a dupe
        except StopIteration:
            pass
        return False
    
    def getDupes(fn,c):
        return list(fn(c))
    
    STABLE = True
    if STABLE:
        print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
    else:
        print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
    for location in (50,250000,500000,750000,999999):
        for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                     getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
            print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
            deltas = []
            for FIRST in (True,False):
                for i in xrange(0, 5):
                    c = range(0,1000000)
                    if STABLE:
                        c[0] = location
                    else:
                        c.append(location)
                        random.shuffle(c)
                    start = time.time()
                    if FIRST:
                        print '.' if location == test(c).next() else '!',
                    else:
                        print '.' if [location] == list(test(c)) else '!',
                    deltas.append(time.time()-start)
                print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
            print
        print
    

    The results for the 'all dupes' test were consistent, finding "first" duplicate then "all" duplicates in this array:

    Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
    Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
    Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
    Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
    Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
    Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
    Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
    Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
    Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
    Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
    Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  
    

    When the lists are shuffled first, the price of the sort becomes apparent - the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

    Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
    Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
    Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
    Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
    Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
    Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
    Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
    Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
    Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
    Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
    Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  
    
    0 讨论(0)
  • 2020-11-22 01:09

    We can use itertools.groupby in order to find all the items that have dups:

    from itertools import groupby
    
    myList  = [2, 4, 6, 8, 4, 6, 12]
    # when the list is sorted, groupby groups by consecutive elements which are similar
    for x, y in groupby(sorted(myList)):
        #  list(y) returns all the occurences of item x
        if len(list(y)) > 1:
            print x  
    

    The output will be:

    4
    6
    
    0 讨论(0)
  • 2020-11-22 01:09

    I guess the most effective way to find duplicates in a list is:

    from collections import Counter
    
    def duplicates(values):
        dups = Counter(values) - Counter(set(values))
        return list(dups.keys())
    
    print(duplicates([1,2,3,6,5,2]))
    

    It uses Counter once on all the elements, and then on all unique elements. Subtracting the first one with the second will leave out the duplicates only.

    0 讨论(0)
提交回复
热议问题