Python: List vs Dict for look up table

后端未结

关注

 8  920

爱一瞬间的悲伤 2020-11-22 10:41

I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?

I know

8条回答

难免孤独 (楼主)

2020-11-22 11:00
As a new set of tests to show @EriF89 is still right after all these years:
```
$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop
```
Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!

SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:
```
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})
```
With this dict, one can know:
1. If a value was in the original dataset (ie 2 in d returns True)
2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...