I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?
I know
As a new set of tests to show @EriF89 is still right after all these years:
$ python -m timeit -s "l={k:k for k in xrange(5000)}" "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]" "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])" "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])" "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop
Here we also compare a tuple
, which are known to be faster than lists
(and use less memory) in some use cases. In the case of lookup table, the tuple
faired no better .
Both the dict
and set
performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!
SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.
For example, if the source data list to be searched was l=[1,2,3,1,2,1,4]
, it could be optimized for both searching and memory by replacing it with this dict:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
... d[e].append(i)
>>> d
defaultdict(, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})
With this dict, one can know:
2 in d
returns True
)d[2]
returns list of indices where data was found in original data list: [1, 4]
)