Python: List vs Dict for look up table

后端 未结 8 920
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-22 10:41

I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?

I know

8条回答
  •  难免孤独
    2020-11-22 11:00

    As a new set of tests to show @EriF89 is still right after all these years:

    $ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
    1000 loops, best of 3: 1.84 msec per loop
    $ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
    10 loops, best of 3: 573 msec per loop
    $ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
    10 loops, best of 3: 587 msec per loop
    $ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
    1000 loops, best of 3: 1.88 msec per loop
    

    Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

    Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!

    SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

    For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:

    >>> from collections import defaultdict
    >>> d = defaultdict(list)
    >>> l=[1,2,3,1,2,1,4]
    >>> for i, e in enumerate(l):
    ...     d[e].append(i)
    >>> d
    defaultdict(, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})
    

    With this dict, one can know:

    1. If a value was in the original dataset (ie 2 in d returns True)
    2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])

提交回复
热议问题