nested dictionaries or tuples for key?

后端 未结 4 2162
自闭症患者
自闭症患者 2020-12-15 04:49

Suppose there is a structure like this:

{\'key1\' : { \'key2\' : { .... { \'keyn\' : \'value\' } ... } } }

Using python, I\'m trying to det

相关标签:
4条回答
  • 2020-12-15 04:56

    Without going into details (which are highly implementation-dependent anyway and may be invalidated by the next genius to come along and tweak the dictionary implementation):

    • For memory overhead: Each object has some overhead (e.g. refcount and type; an empty object is 8 bytes and an empty tuple is 28 bytes), but hash tables need to store hash, key and value and usually use more buckets than currently needed to avoid collision. Tuples on the other hand can't be resized and don't have collisions, i.e. a N-tuple can simple allocate N pointers to the contained objects and be done. This leads to noticeable differences in memory consumption.
    • For lookup and insertion complexity (the two are identical in this regard): Be it a string or a tuple, collisions are rather unlikely in CPython's dict implementation, and resolved very efficiently. More keys (because you flatten the key space by combining the keys in tuples) may seem to increase the likelihood of collisions, more keys also lead to more buckets (AFAIK the current implementation tries to keep the load factor between 2/3), which in turn makes collision less likely. Also, you don't need more hashing (well, one more function call and some C-level xor-ing for the tuple hash, but that's negible) to get to a value.

    You see, there shouldn't be any noticeable difference in performance, although some memory difference. The latter won't be notable though, I think. A one-element dict is 140 bytes, a ten-element tuple is 140 bytes as well (according to Python 3.2 sys.getsizeof). So even with the (already unrealistic, says my gut-feeling) ten-level nesting, you'd have slightly more than one kB of difference - possibly less if the nested dicts have multiple items (depends on the exact load factor). That's too much for a data-crunching application that has hundreds of such data structure im memory, but most objects aren't created that frequently.

    You should simply ask yourself which model is more appropriate for your problem. Consider that the second way requires you to have all keys for a value available at once, while the second allows getting there incrementally.

    0 讨论(0)
  • 2020-12-15 05:00

    Performance Testing

    I performed tests for looping, retrieval, and insertion for a nested dictionary and a dictionary with a tuple. They are one level deep, 2000.000 values. I also did retrieval and insertion for the tuple dict with the tuple already created.

    These are the results. I think you cannot really bind conclusions to the std dev.

    -

    keydictinsertion: Mean +- std dev: 615 ms +- 42 ms  
    keydictretrieval: Mean +- std dev: 475 ms +- 77 ms  
    keydictlooping: Mean +- std dev: 76.2 ms +- 7.4 ms  
    
    nesteddictinsertion: Mean +- std dev: 200 ms +- 7 ms  
    nesteddictretrieval: Mean +- std dev: 256 ms +- 32 ms  
    nesteddictlooping: Mean +- std dev: 88.5 ms +- 14.4 ms  
    
    Test were the tuple was already created for the keydict  
    keydictinsertionprepared: Mean +- std dev: 432 ms +- 26 ms  
    keydictretrievalprepared: Mean +- std dev: 267 ms +- 14 ms
    

    -

    As you can can see, the nesteddict if often faster than the dict with a single key. Even when giving the keydict a tuple directly without the tuple creation step, insertion was still much slower. It seemed that the additional creation of an inner dict is not so much cost. Defaultdict has probably a fast implementation. Retrieval was actually almost equal when it did not have to create a tuple, the same with looping.

    Testing is done with perf from the command line. Scripts are below.

    >>>>>>> nesteddictinsertion
    python -m perf timeit -v -s "
    from collections import defaultdict
    " " 
    d = defaultdict(dict)
    for i in range(2000):
        for j in range(1000):
            d[i][j] = 1
    "
    >>>>>>> nesteddictlooping
    python -m perf timeit -v -s "
    from collections import defaultdict
    d = defaultdict(dict)
    for i in range(2000):
        for j in range(1000):
            d[i][j] = 1
    " "
    for i, inner_dict in d.items():
        for j, val in inner_dict.items():
            i
            j
            val
    "
    >>>>>>> nesteddictretrieval
    python -m perf timeit -v -s "
    from collections import defaultdict
    d = defaultdict(dict)
    for i in range(2000):
        for j in range(1000):
            d[i][j] = 1
    " "
    for i in range(2000):
        for j in range(1000):
            d[i][j]
    "
    >>>>>>> keydictinsertion
    python -m perf timeit -v -s "
    from collections import defaultdict
    " " 
    d = {}
    for i in range(2000):
        for j in range(1000):
            d[i, j] = 1
    "
    >>>>>>> keydictinsertionprepared
    python -m perf timeit -v -s "
    from collections import defaultdict
    keys = [(i, j) for i in range(2000) for j in range(1000)]
    " " 
    d = {}
    for key in keys:
        d[key] = 1
    "
    >>>>>>> keydictlooping
    python -m perf timeit -v -s "
    from collections import defaultdict
    d = {}
    for i in range(2000):
        for j in range(1000):
            d[i, j] = 1
    " "
    for key, val in d.items():
        key
        val
    "
    >>>>>>> keydictretrieval
    python -m perf timeit -v -s "
    from collections import defaultdict
    d = {}
    for i in range(2000):
        for j in range(1000):
            d[i, j] = 1
    " "
    for i in range(2000):
        for j in range(1000):
            d[i, j]
    "
    >>>>>>> keydictretrievalprepared
    python -m perf timeit -v -s "
    from collections import defaultdict
    d = {}
    keys = [(i, j) for i in range(2000) for j in range(1000)]
    for key in keys:
        d[key] = 1
    " "
    for key in keys:
        d[key]
    "
    
    0 讨论(0)
  • 2020-12-15 05:09

    If you need to use the whole combination of key1 to keyn to get value, you can flip the dict like I suggested below for O(nk*nv) (number of keys * number of values) or use the tuple method above.

    Assuming you need to build the tuple on insertion and again when you need to get a value, both will be O(nk) where nk is the number of keys.

    The nested dict version might be more space efficient if you're nested fairly deeply (there are lots of values that share a partial ordered list of keys), and getting a value will still be O(nk), but probably slower than the tuple version.

    However, insertion will be slower, though I can't quantify it's speed. You will have to construct at least one layer of dict for each insertion, and test for the existence of dicts at the previous levels.

    There are many recipes for recursive defaultdicts that would simplify insertion from a coding perspective, but it wouldn't actually speed things up.

    The tuple method is both simpler and faster for insertion, but may take more memory depending on your nesting.


    Original answer from before I knew the requirements

    Why not

    {'key1' : 'value', 'key2' : 'value', .... 'keyn' : 'value' } 
    

    It's just storing a reference to value at each location, not value itself, so the memory usage would be less than the nested dict version, and not much larger than the tuple version, unless you have an extremely large number of values.

    For time complexity of Python standard type operations, see the Python wiki.

    Basically, insertion or getting one item on average is O(1).

    Getting all the keys for a value would on average be O(n):

    [key for key in mydict if mydict[key] == value]
    

    However, if adding keys, or searching for all keys, is the usual operation, your dict is flipped. You want:

    {'value': [key1, key2, ... keyn]}
    

    This way you can add keys with just mydict[value].append(newkey) and get all the keys with just mydict[value] and both will be on average O(1).

    0 讨论(0)
  • 2020-12-15 05:13

    Memory Consumption Testing

    I've written a small script to test it. It has some limitations though, the keys are made from integers linearly distirbuted (i.e. range(N)), my findings are the following.

    With a 3-level nesting, i.e. dict[a][b][c] vs dict[a,b,c] where each sub index goes from 0 to 99, I find the following:

    With large values (list(x for x in range(100))):

    > memory.py nested 
    Memory usage: 1049.0 MB
    > memory.py flat  
    Memory usage: 1149.7 MB
    

    and with small values ([]):

    > memory.py nested
    Memory usage: 134.1 MB
    > memory.py flat
    Memory usage: 234.8 MB
    

     Open questions

    • Why is this happening?
    • Would this change with different indices, e.g. non-consecutive ones?

    Script

    #!/usr/bin/env python3
    import resource
    import random
    import itertools
    import sys
    import copy
    from os.path import basename
    from collections import defaultdict
     
    # constants
    index_levels = [100, 100, 100]
    value_size   = 100 # try values like 0
    
    def memory_usage():
        return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    
    _object_mold = list(x for x in range(value_size)) # performance hack
    def create_object():
        return copy.copy(_object_mold)
    
    # automatically create nested dict
    # http://code.activestate.com/recipes/578057-recursive-defaultdict/
    f = lambda: defaultdict(f)
    my_dict = defaultdict(f)
    
    # options & usage
    try:
        dict_mode = sys.argv[1]
        if dict_mode not in ['flat', 'nested']: # ugly hack
            raise Error()
    except:
        print("Usage: {} [nested | flat]".format(basename(sys.argv[0])))
        exit()
     
    index_generator = [range(level) for level in index_levels]
    
    if dict_mode == "flat":
        for index in itertools.product(*index_generator):
            my_dict[index] = create_object()
    elif dict_mode == "nested":
        for index in itertools.product(*index_generator):
            sub_dict = my_dict
            for sub_index in index[:-1]:          # iterate into lowest dict
                sub_dict = sub_dict[sub_index]
            sub_dict[index[-1]] = create_object() # finally assign value
    
    print("Memory usage: {:.1f} MB".format(memory_usage() / 1024**2))
    
    0 讨论(0)
提交回复
热议问题