How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.
Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets
a=[1,2,3,3,3]
dup=[]
for each in a:
if each not in dup:
dup.append(each)
print(dup)
======= else to get 2 separate lists of unique values and duplicate values
a=[1,2,3,3,3]
uniques=[]
dups=[]
for each in a:
if each not in uniques:
uniques.append(each)
else:
dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)
Here's a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.
For lists with all elements that are hashable types:
def gen_dupes(array):
unique = {}
for value in array:
if value in unique and unique[value]:
unique[value] = False
yield value
else:
unique[value] = True
array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]
For lists that might contain lists:
def gen_dupes(array):
unique = {}
for value in array:
is_list = False
if type(value) is list:
value = tuple(value)
is_list = True
if value in unique and unique[value]:
unique[value] = False
if is_list:
value = list(value)
yield value
else:
unique[value] = True
array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]
def removeduplicates(a):
seen = set()
for i in a:
if i not in seen:
seen.add(i)
return seen
print(removeduplicates([1,1,2,2]))
one-liner, for fun, and where a single statement is required.
(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)
You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:
def list_duplicates(seq):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )
a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]
Just in case speed matters, here are some timings:
# file: test.py
import collections
def thg435(l):
return [x for x, y in collections.Counter(l).items() if y > 1]
def moooeeeep(l):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in l if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )
def RiteshKumar(l):
return list(set([x for x in l if l.count(x) > 1]))
def JohnLaRooy(L):
seen = set()
seen2 = set()
seen_add = seen.add
seen2_add = seen2.add
for item in L:
if item in seen:
seen2_add(item)
else:
seen_add(item)
return list(seen2)
l = [1,2,3,2,1,5,6,5,5,5]*100
Here are the results: (well done @JohnLaRooy!)
$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop
Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop
Apparantly this effect is related to the "duplicatedness" of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)]
and got these results:
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop
>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])