问题
I have a list of values. I wish to count during a loop the number of element for each class (i.e. 1,2,3,4,5)
mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
mydict = dict()
for index in mylist:
mydict[index] = +1
mydict
Out[344]: {1: 1, 2: 1, 3: 1, 4: 1, 5: 1}
I wish to get this result
Out[344]: {1: 6, 2: 5, 3: 3, 4: 1, 5: 4}
回答1:
For your smaller example, with a limited diversity of elements, you can use a set and a dict comprehension:
>>> mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
>>> {k:mylist.count(k) for k in set(mylist)}
{1: 6, 2: 5, 3: 3, 4: 1, 5: 4}
To break it down, set(mylist)
uniquifies the list and makes it more compact:
>>> set(mylist)
set([1, 2, 3, 4, 5])
Then the dictionary comprehension steps through the unique values and sets the count from the list.
This also is significantly faster than using Counter and faster than using setdefault:
from __future__ import print_function
from collections import Counter
from collections import defaultdict
import random
mylist=[1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]*10
def s1(mylist):
return {k:mylist.count(k) for k in set(mylist)}
def s2(mlist):
return Counter(mylist)
def s3(mylist):
mydict=dict()
for index in mylist:
mydict[index] = mydict.setdefault(index, 0) + 1
return mydict
def s4(mylist):
mydict={}.fromkeys(mylist,0)
for k in mydict:
mydict[k]=mylist.count(k)
return mydict
def s5(mylist):
mydict={}
for k in mylist:
mydict[k]=mydict.get(k,0)+1
return mydict
def s6(mylist):
mydict=defaultdict(int)
for i in mylist:
mydict[i] += 1
return mydict
def s7(mylist):
mydict={}.fromkeys(mylist,0)
for e in mylist:
mydict[e]+=1
return mydict
if __name__ == '__main__':
import timeit
n=1000000
print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n))
print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n))
print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n))
print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n))
print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n))
print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n))
print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n))
On my machine that prints (Python 3):
18.123854104997008 # set and dict comprehension
78.54796334600542 # Counter
33.98185228800867 # setdefault
19.0563529439969 # fromkeys / count
34.54294775899325 # dict.get
21.134678319009254 # defaultdict
22.760544238000875 # fromkeys / loop
For Larger lists, like 10 million integers, with more diverse elements (1,500 random ints), use defaultdict or fromkeys in a loop:
from __future__ import print_function
from collections import Counter
from collections import defaultdict
import random
mylist = [random.randint(0,1500) for _ in range(10000000)]
def s1(mylist):
return {k:mylist.count(k) for k in set(mylist)}
def s2(mlist):
return Counter(mylist)
def s3(mylist):
mydict=dict()
for index in mylist:
mydict[index] = mydict.setdefault(index, 0) + 1
return mydict
def s4(mylist):
mydict={}.fromkeys(mylist,0)
for k in mydict:
mydict[k]=mylist.count(k)
return mydict
def s5(mylist):
mydict={}
for k in mylist:
mydict[k]=mydict.get(k,0)+1
return mydict
def s6(mylist):
mydict=defaultdict(int)
for i in mylist:
mydict[i] += 1
return mydict
def s7(mylist):
mydict={}.fromkeys(mylist,0)
for e in mylist:
mydict[e]+=1
return mydict
if __name__ == '__main__':
import timeit
n=1
print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n))
print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n))
print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n))
print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n))
print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n))
print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n))
print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n))
Prints:
2825.2697427899984 # set and dict comprehension
42.607481333994656 # Counter
22.77713537499949 # setdefault
2853.11187016801 # fromkeys / count
23.241977066005347 # dict.get
15.023175164998975 # defaultdict
18.28165417900891 # fromkeys / loop
You can see that solutions that relay on count
with a moderate number of times through the large list will suffer badly/catastrophically in comparison to other solutions.
回答2:
Try collections.Counter
:
>>> from collections import Counter
>>> Counter([1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5])
Counter({1: 6, 2: 5, 5: 4, 3: 3, 4: 1})
In your code you can basically replace mydict
with a Counter
and write
mydict[index] += 1
instead of
mydict[index] = +1
回答3:
A variation on the setdefault
approach is the collections.defaultdict
. This is a bit faster.
def foo(mylist):
d=defaultdict(int)
for i in mylist:
d[i] += 1
return d
itertools.groupBy
provides another option. It's speed is about the same as Counter
(at least on 2.7)
{x[0]:len(list(x[1])) for x in itertools.groupby(sorted(mylist))}
However time tests on this small test list might not be the same when dealing the 32Gb of data that the OP mentions in a comment.
I ran several of these options in the word count case in python top N word count, why multiprocess slower then single process
There the OP used Counter, and was trying to speed things up by using multiprocessing. With a 1.2Mb text file, the counter using defaultdict
was fast, take 0.2sec. Sorting the output to get the top 40 words took as long as the counting itself.
Counter
was a bit slower on 3.2
, and much slower on 2.7
. That's because 3.2
a compiled version (.so
file).
But the counter using mylist.count
ground to a standstill when processing a large list; almost 200 sec. It has to search that large list many times, once to collect keys, and then once for each key when it counts.
回答4:
To rectify code:
mydict[index] = +1
should be:
mydict[index] = mydict.setdefault(index, 0) + 1
回答5:
Your code is assigning 1 as the value for each key. Replace mydict[index] = +1
with mylist.count(index)
This should work:
mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
mydict = dict()
for index in mylist:
mydict[index] = mylist.count(index)
mydict
来源:https://stackoverflow.com/questions/18343472/efficient-way-to-count-the-element-in-a-dictionary-in-python-using-a-loop