I have a collections.defaultdict(int) that I\'m building to keep count of how many times a key shows up in a set of data. I later want to be able to sort it (obviously by turnin
from collections import defaultdict
adict = defaultdict(int)
adict['a'] += 1
adict['b'] += 3
adict['c'] += 5
adict['d'] += 2
for key, value in sorted(adict.items(), lambda a, b: cmp(a[1], b[1]), reverse=True):
print "%r => %r" % (key, value)
>>>
'c' => 5
'b' => 3
'd' => 2
'a' => 1
"Invert" a dictionary.
from collections import defaultdict
inv_dict = defaultdict( list )
for key, value in adict:
inv_dict[value].append( key )
max_value= max( inv_dict.keys() )
The set of keys with the maximum occurrence --
inv_dict[max_value]
The set of keys in descending order by occurrence --
for value, key_list in sorted( inv_dict ):
print key_list, value
Note: I'm putting this in as an answer so that it gets seen. I don't want upvotes. If you want to upvote anyone, upvote Nadia.
The currently accepted answer gives timing results which are based on a trivially small dataset (size == 6 - (-5) == 11). The differences in cost of the various methods are masked by the overhead. A use case like what are the most frequent words in a text or most frequent names in a membership list or census involves much larger datasets.
Repeating the experiment with range(-n,n+1) (Windows box, Python 2.6.4, all times in microseconds):
n=5: 11.5, 9.34, 11.3
n=50: 65.5, 46.2, 68.1
n=500: 612, 423, 614
These results are NOT "slightly" different. The itemgetter answer is a clear winner on speed.
There was also mention of "the simplicity of the get
idiom". Putting them close together for ease of comparison:
[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)]
sorted(adict.iteritems(), key=itemgetter(1), reverse=True)
The get
idiom not only looks up the dict twice (as J. F. Sebastian has pointed out), it makes one list (result of sorted()
) then iterates over that list to create a result list. I'd call that baroque, not simple. YMMV.
To get the dictionary sorted:
from operator import itemgetter
sorted(adict.iteritems(), key=itemgetter(1), reverse=True)
If you're using the newest python 2.7 alpha, then you can use the Counter class in collections module:
c = Counter()
c['someval'] += 1
c['anotherval'] += 1
c['someval'] += 1
print c.most_common()
prints in the correct order:
[('someval', 2), ('anotherval', 1)]
The code used on 2.7 is available already and there's a version adapted to 2.5. Perhaps you want to use it to stay forward compatible with the native stdlib version that is about to be released.
A dict's keys, reverse-sorted by the corresponding values, can best be gotten as
sorted(adict, key=adict.get, reverse=True)
since you want key/value pairs, you could work on the items as all other answers suggest, or (to use the nifty adict.get
bound method instead of itemgetters or weird lambdas;-),
[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)]
Edit: in terms of performance, there isn't much into it either way:
$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6))' '[(k, adict[k]) for k in sorted(adict, key=adict.get, reverse=True)]'
100000 loops, best of 3: 10.8 usec per loop
$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6)); from operator import itemgetter' 'sorted(adict.iteritems(), key=itemgetter(1), reverse=True)'
100000 loops, best of 3: 9.66 usec per loop
$ python -mtimeit -s'adict=dict((x,x**2) for x in range(-5,6))' 'sorted(adict.iteritems(), key=lambda (k,v): v, reverse=True)'
100000 loops, best of 3: 11.5 usec per loop
So, the .get
-based solution is smack midway in performance between the two items
-based ones -- slightly slower than the itemgetter
, slightly faster than the lambda
. In "bottleneck" cases, where those microsecond fractions are crucial to you, by all means do focus on that. In normal cases, where this operation is only one step within some bigger task and a microsecond more or less matters little, focusing on the simplicity of the get
idiom is, however, also a reasonable alternative.