How do I find the duplicates in a list and create another list with them?

前端 未结 30 1573
梦谈多话
梦谈多话 2020-11-22 00:56

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.

相关标签:
30条回答
  • 2020-11-22 01:24

    When using toolz:

    from toolz import frequencies, valfilter
    
    a = [1,2,2,3,4,5,4]
    >>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
    [2,4] 
    
    0 讨论(0)
  • 2020-11-22 01:25

    Using pandas:

    >>> import pandas as pd
    >>> a = [1, 2, 1, 3, 3, 3, 0]
    >>> pd.Series(a)[pd.Series(a).duplicated()].values
    array([1, 3, 3])
    
    0 讨论(0)
  • 2020-11-22 01:26

    Here's a neat and concise solution -

    for x in set(li):
        li.remove(x)
    
    li = list(set(li))
    
    0 讨论(0)
  • 2020-11-22 01:26

    use of list.count() method in the list to find out the duplicate elements of a given list

    arr=[]
    dup =[]
    for i in range(int(input("Enter range of list: "))):
        arr.append(int(input("Enter Element in a list: ")))
    for i in arr:
        if arr.count(i)>1 and i not in dup:
            dup.append(i)
    print(dup)
    
    0 讨论(0)
  • 2020-11-22 01:27

    To remove duplicates use set(a). To print duplicates, something like:

    a = [1,2,3,2,1,5,6,5,5,5]
    
    import collections
    print([item for item, count in collections.Counter(a).items() if count > 1])
    
    ## [1, 2, 5]
    

    Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

    seen = set()
    uniq = []
    for x in a:
        if x not in seen:
            uniq.append(x)
            seen.add(x)
    

    or, more concisely:

    seen = set()
    uniq = [x for x in a if x not in seen and not seen.add(x)]    
    

    I don't recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

    To compute the list of duplicated elements without libraries:

    seen = {}
    dupes = []
    
    for x in a:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1
    

    If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

    a = [[1], [2], [3], [1], [5], [3]]
    
    no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
    print no_dupes # [[1], [2], [3], [5]]
    
    dupes = [x for n, x in enumerate(a) if x in a[:n]]
    print dupes # [[1], [3]]
    
    0 讨论(0)
  • 2020-11-22 01:27

    You can use iteration_utilities.duplicates:

    >>> from iteration_utilities import duplicates
    
    >>> list(duplicates([1,1,2,1,2,3,4,2]))
    [1, 1, 2, 2]
    

    or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen:

    >>> from iteration_utilities import unique_everseen
    
    >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
    [1, 2]
    

    It can also handle unhashable elements (however at the cost of performance):

    >>> list(duplicates([[1], [2], [1], [3], [1]]))
    [[1], [1]]
    
    >>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
    [[1]]
    

    That's something that only a few of the other approaches here can handle.

    Benchmarks

    I did a quick benchmark containing most (but not all) of the approaches mentioned here.

    The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior.

    In the graphs the y-axis represents the time, so a lower value means better. It's also plotted log-log so the wide range of values can be visualized better:

    Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list:

    As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

    One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

    However as these benchmarks show most of the approaches perform roughly equally, so it doesn't matter much which one is used (except for the 3 that had O(n**2) runtime).

    from iteration_utilities import duplicates, unique_everseen
    from collections import Counter
    import pandas as pd
    import itertools
    
    def georg_counter(it):
        return [item for item, count in Counter(it).items() if count > 1]
    
    def georg_set(it):
        seen = set()
        uniq = []
        for x in it:
            if x not in seen:
                uniq.append(x)
                seen.add(x)
    
    def georg_set2(it):
        seen = set()
        return [x for x in it if x not in seen and not seen.add(x)]   
    
    def georg_set3(it):
        seen = {}
        dupes = []
    
        for x in it:
            if x not in seen:
                seen[x] = 1
            else:
                if seen[x] == 1:
                    dupes.append(x)
                seen[x] += 1
    
    def RiteshKumar_count(l):
        return set([x for x in l if l.count(x) > 1])
    
    def moooeeeep(seq):
        seen = set()
        seen_add = seen.add
        # adds all elements it doesn't know yet to seen and all other to seen_twice
        seen_twice = set( x for x in seq if x in seen or seen_add(x) )
        # turn the set into a list (as requested)
        return list( seen_twice )
    
    def F1Rumors_implementation(c):
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in zip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k
    
    def F1Rumors(c):
        return list(F1Rumors_implementation(c))
    
    def Edward(a):
        d = {}
        for elem in a:
            if elem in d:
                d[elem] += 1
            else:
                d[elem] = 1
        return [x for x, y in d.items() if y > 1]
    
    def wordsmith(a):
        return pd.Series(a)[pd.Series(a).duplicated()].values
    
    def NikhilPrabhu(li):
        li = li.copy()
        for x in set(li):
            li.remove(x)
    
        return list(set(li))
    
    def firelynx(a):
        vc = pd.Series(a).value_counts()
        return vc[vc > 1].index.tolist()
    
    def HenryDev(myList):
        newList = set()
    
        for i in myList:
            if myList.count(i) >= 2:
                newList.add(i)
    
        return list(newList)
    
    def yota(number_lst):
        seen_set = set()
        duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
        return seen_set - duplicate_set
    
    def IgorVishnevskiy(l):
        s=set(l)
        d=[]
        for x in l:
            if x in s:
                s.remove(x)
            else:
                d.append(x)
        return d
    
    def it_duplicates(l):
        return list(duplicates(l))
    
    def it_unique_duplicates(l):
        return list(unique_everseen(duplicates(l)))
    

    Benchmark 1

    from simple_benchmark import benchmark
    import random
    
    funcs = [
        georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
        F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
        HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
    ]
    
    args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}
    
    b = benchmark(funcs, args, 'list size')
    
    b.plot()
    

    Benchmark 2

    funcs = [
        georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
        F1Rumors, Edward, wordsmith, firelynx,
        yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
    ]
    
    args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}
    
    b = benchmark(funcs, args, 'list size')
    b.plot()
    

    Disclaimer

    1 This is from a third-party library I have written: iteration_utilities.

    0 讨论(0)
提交回复
热议问题