What is the best way to implement nested dictionaries?

后端 未结 21 1833
[愿得一人]
[愿得一人] 2020-11-22 00:29

I have a data structure which essentially amounts to a nested dictionary. Let\'s say it looks like this:

{\'new jersey\': {\'mercer county\': {\'plumbers\':          


        
相关标签:
21条回答
  • 2020-11-22 00:49

    If the number of nesting levels is small, I use collections.defaultdict for this:

    from collections import defaultdict
    
    def nested_dict_factory(): 
      return defaultdict(int)
    def nested_dict_factory2(): 
      return defaultdict(nested_dict_factory)
    db = defaultdict(nested_dict_factory2)
    
    db['new jersey']['mercer county']['plumbers'] = 3
    db['new jersey']['mercer county']['programmers'] = 81
    

    Using defaultdict like this avoids a lot of messy setdefault(), get(), etc.

    0 讨论(0)
  • 2020-11-22 00:49
    class JobDb(object):
        def __init__(self):
            self.data = []
            self.all = set()
            self.free = []
            self.index1 = {}
            self.index2 = {}
            self.index3 = {}
    
        def _indices(self,(key1,key2,key3)):
            indices = self.all.copy()
            wild = False
            for index,key in ((self.index1,key1),(self.index2,key2),
                                                 (self.index3,key3)):
                if key is not None:
                    indices &= index.setdefault(key,set())
                else:
                    wild = True
            return indices, wild
    
        def __getitem__(self,key):
            indices, wild = self._indices(key)
            if wild:
                return dict(self.data[i] for i in indices)
            else:
                values = [self.data[i][-1] for i in indices]
                if values:
                    return values[0]
    
        def __setitem__(self,key,value):
            indices, wild = self._indices(key)
            if indices:
                for i in indices:
                    self.data[i] = key,value
            elif wild:
                raise KeyError(k)
            else:
                if self.free:
                    index = self.free.pop(0)
                    self.data[index] = key,value
                else:
                    index = len(self.data)
                    self.data.append((key,value))
                    self.all.add(index)
                self.index1.setdefault(key[0],set()).add(index)
                self.index2.setdefault(key[1],set()).add(index)
                self.index3.setdefault(key[2],set()).add(index)
    
        def __delitem__(self,key):
            indices,wild = self._indices(key)
            if not indices:
                raise KeyError
            self.index1[key[0]] -= indices
            self.index2[key[1]] -= indices
            self.index3[key[2]] -= indices
            self.all -= indices
            for i in indices:
                self.data[i] = None
            self.free.extend(indices)
    
        def __len__(self):
            return len(self.all)
    
        def __iter__(self):
            for key,value in self.data:
                yield key
    

    Example:

    >>> db = JobDb()
    >>> db['new jersey', 'mercer county', 'plumbers'] = 3
    >>> db['new jersey', 'mercer county', 'programmers'] = 81
    >>> db['new jersey', 'middlesex county', 'programmers'] = 81
    >>> db['new jersey', 'middlesex county', 'salesmen'] = 62
    >>> db['new york', 'queens county', 'plumbers'] = 9
    >>> db['new york', 'queens county', 'salesmen'] = 36
    
    >>> db['new york', None, None]
    {('new york', 'queens county', 'plumbers'): 9,
     ('new york', 'queens county', 'salesmen'): 36}
    
    >>> db[None, None, 'plumbers']
    {('new jersey', 'mercer county', 'plumbers'): 3,
     ('new york', 'queens county', 'plumbers'): 9}
    
    >>> db['new jersey', 'mercer county', None]
    {('new jersey', 'mercer county', 'plumbers'): 3,
     ('new jersey', 'mercer county', 'programmers'): 81}
    
    >>> db['new jersey', 'middlesex county', 'programmers']
    81
    
    >>>
    

    Edit: Now returning dictionaries when querying with wild cards (None), and single values otherwise.

    0 讨论(0)
  • 2020-11-22 00:49

    I have a similar thing going. I have a lot of cases where I do:

    thedict = {}
    for item in ('foo', 'bar', 'baz'):
      mydict = thedict.get(item, {})
      mydict = get_value_for(item)
      thedict[item] = mydict
    

    But going many levels deep. It's the ".get(item, {})" that's the key as it'll make another dictionary if there isn't one already. Meanwhile, I've been thinking of ways to deal with this better. Right now, there's a lot of

    value = mydict.get('foo', {}).get('bar', {}).get('baz', 0)
    

    So instead, I made:

    def dictgetter(thedict, default, *args):
      totalargs = len(args)
      for i,arg in enumerate(args):
        if i+1 == totalargs:
          thedict = thedict.get(arg, default)
        else:
          thedict = thedict.get(arg, {})
      return thedict
    

    Which has the same effect if you do:

    value = dictgetter(mydict, 0, 'foo', 'bar', 'baz')
    

    Better? I think so.

    0 讨论(0)
  • 2020-11-22 00:50

    What is the best way to implement nested dictionaries in Python?

    This is a bad idea, don't do it. Instead, use a regular dictionary and use dict.setdefault where apropos, so when keys are missing under normal usage you get the expected KeyError. If you insist on getting this behavior, here's how to shoot yourself in the foot:

    Implement __missing__ on a dict subclass to set and return a new instance.

    This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:

    class Vividict(dict):
        def __missing__(self, key):
            value = self[key] = type(self)() # retain local pointer to value
            return value                     # faster to return than dict lookup
    

    (Note self[key] is on the left-hand side of assignment, so there's no recursion here.)

    and say you have some data:

    data = {('new jersey', 'mercer county', 'plumbers'): 3,
            ('new jersey', 'mercer county', 'programmers'): 81,
            ('new jersey', 'middlesex county', 'programmers'): 81,
            ('new jersey', 'middlesex county', 'salesmen'): 62,
            ('new york', 'queens county', 'plumbers'): 9,
            ('new york', 'queens county', 'salesmen'): 36}
    

    Here's our usage code:

    vividict = Vividict()
    for (state, county, occupation), number in data.items():
        vividict[state][county][occupation] = number
    

    And now:

    >>> import pprint
    >>> pprint.pprint(vividict, width=40)
    {'new jersey': {'mercer county': {'plumbers': 3,
                                      'programmers': 81},
                    'middlesex county': {'programmers': 81,
                                         'salesmen': 62}},
     'new york': {'queens county': {'plumbers': 9,
                                    'salesmen': 36}}}
    

    Criticism

    A criticism of this type of container is that if the user misspells a key, our code could fail silently:

    >>> vividict['new york']['queens counyt']
    {}
    

    And additionally now we'd have a misspelled county in our data:

    >>> pprint.pprint(vividict, width=40)
    {'new jersey': {'mercer county': {'plumbers': 3,
                                      'programmers': 81},
                    'middlesex county': {'programmers': 81,
                                         'salesmen': 62}},
     'new york': {'queens county': {'plumbers': 9,
                                    'salesmen': 36},
                  'queens counyt': {}}}
    

    Explanation:

    We're just providing another nested instance of our class Vividict whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can't return it as it is being set.)

    Note, these are the same semantics as the most upvoted answer but in half the lines of code - nosklo's implementation:

    class AutoVivification(dict):
        """Implementation of perl's autovivification feature."""
        def __getitem__(self, item):
            try:
                return dict.__getitem__(self, item)
            except KeyError:
                value = self[item] = type(self)()
                return value
    

    Demonstration of Usage

    Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.

    import pprint
    
    class Vividict(dict):
        def __missing__(self, key):
            value = self[key] = type(self)()
            return value
    
    d = Vividict()
    
    d['foo']['bar']
    d['foo']['baz']
    d['fizz']['buzz']
    d['primary']['secondary']['tertiary']['quaternary']
    pprint.pprint(d)
    

    Which outputs:

    {'fizz': {'buzz': {}},
     'foo': {'bar': {}, 'baz': {}},
     'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}
    

    And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__ to set a new instance of its class to the key and return it is a far better solution.

    Other alternatives, for contrast:

    dict.setdefault

    Although the asker thinks this isn't clean, I find it preferable to the Vividict myself.

    d = {} # or dict()
    for (state, county, occupation), number in data.items():
        d.setdefault(state, {}).setdefault(county, {})[occupation] = number
    

    and now:

    >>> pprint.pprint(d, width=40)
    {'new jersey': {'mercer county': {'plumbers': 3,
                                      'programmers': 81},
                    'middlesex county': {'programmers': 81,
                                         'salesmen': 62}},
     'new york': {'queens county': {'plumbers': 9,
                                    'salesmen': 36}}}
    

    A misspelling would fail noisily, and not clutter our data with bad information:

    >>> d['new york']['queens counyt']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'queens counyt'
    

    Additionally, I think setdefault works great when used in loops and you don't know what you're going to get for keys, but repetitive usage becomes quite burdensome, and I don't think anyone would want to keep up the following:

    d = dict()
    
    d.setdefault('foo', {}).setdefault('bar', {})
    d.setdefault('foo', {}).setdefault('baz', {})
    d.setdefault('fizz', {}).setdefault('buzz', {})
    d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})
    

    Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:

    >>> id({}), id({}), id({})
    (523575344, 523575344, 523575344)
    

    An auto-vivified defaultdict

    This is a neat looking implementation, and usage in a script that you're not inspecting the data on would be as useful as implementing __missing__:

    from collections import defaultdict
    
    def vivdict():
        return defaultdict(vivdict)
    

    But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:

    >>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
    >>> pprint.pprint(d)
    defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
    at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
    defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
    vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
    {'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
    <function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
    0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})
    

    This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.

    Performance

    Finally, let's look at performance. I'm subtracting the costs of instantiation.

    >>> import timeit
    >>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
    0.13612580299377441
    >>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
    0.2936999797821045
    >>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
    0.5354437828063965
    >>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
    2.138362169265747
    

    Based on performance, dict.setdefault works the best. I'd highly recommend it for production code, in cases where you care about execution speed.

    If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn't really matter - in which case, I'd go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__ instead of __missing__, which was made for this purpose) it is far superior.

    Conclusion

    Implementing __missing__ on a subclassed dict to set and return a new instance is slightly more difficult than alternatives but has the benefits of

    • easy instantiation
    • easy data population
    • easy data viewing

    and because it is less complicated and more performant than modifying __getitem__, it should be preferred to that method.

    Nevertheless, it has drawbacks:

    • Bad lookups will fail silently.
    • The bad lookup will remain in the dictionary.

    Thus I personally prefer setdefault to the other solutions, and have in every situation where I have needed this sort of behavior.

    0 讨论(0)
  • 2020-11-22 00:52

    I like the idea of wrapping this in a class and implementing __getitem__ and __setitem__ such that they implemented a simple query language:

    >>> d['new jersey/mercer county/plumbers'] = 3
    >>> d['new jersey/mercer county/programmers'] = 81
    >>> d['new jersey/mercer county/programmers']
    81
    >>> d['new jersey/mercer country']
    <view which implicitly adds 'new jersey/mercer county' to queries/mutations>
    

    If you wanted to get fancy you could also implement something like:

    >>> d['*/*/programmers']
    <view which would contain 'programmers' entries>
    

    but mostly I think such a thing would be really fun to implement :D

    0 讨论(0)
  • 2020-11-22 00:53
    class AutoVivification(dict):
        """Implementation of perl's autovivification feature."""
        def __getitem__(self, item):
            try:
                return dict.__getitem__(self, item)
            except KeyError:
                value = self[item] = type(self)()
                return value
    

    Testing:

    a = AutoVivification()
    
    a[1][2][3] = 4
    a[1][3][3] = 5
    a[1][2]['test'] = 6
    
    print a
    

    Output:

    {1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}
    
    0 讨论(0)
提交回复
热议问题