Why don't Python sets preserve insertion order?

独自空忆成欢 提交于 2020-05-26 10:25:42

问题


I was surprised to discover recently that while dicts are guaranteed to preserve insertion order in Python 3.7+, sets are not:

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> d
{'a': 1, 'b': 2, 'c': 3}
>>> d['d'] = 4
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
>>> s = {'a', 'b', 'c'}
>>> s
{'b', 'a', 'c'}
>>> s.add('d')
>>> s
{'d', 'b', 'a', 'c'}

What is the rationale for this difference? Do the same efficiency improvements that led the Python team to change the dict implementation not apply to sets as well?

I'm not looking for pointers to ordered-set implementations or ways to use dicts as stand-ins for sets. I'm just wondering why the Python team didn't make built-in sets preserve order at the same time they did so for dicts.


回答1:


Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.

While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.

  • dictobject.c - master, v3.5.9
  • setobject.c - master, v3.5.9
  • issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.

It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.

Sets remain unordered. (Why? The usage patterns are different. Also, different implementation.)

– Guido van Rossum

Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future.

– Raymond Hettinger

A detailed discussion of whether to compactify sets for 3.7, and answers about why it was decided against, can be found in the python-dev mailing lists.

In summary, the main points are that the usage patterns are different (insertion ordering dicts such as **kwargs is useful, less so for sets), the space savings for compacting sets are less significant (because there are only key and hash array to densify, as opposed to keys, hashes and values), and the aforementioned linear probing optimization in sets is incompatible with a compact implementation.

I will reproduce Raymond's post below which covers the most important points.

On Sep 14, 2016, at 3:50 PM, Eric Snow wrote:

Then, I'll do same to sets.

Unless I've misunderstood, Raymond was opposed to making a similar change to set.

That's right. Here are a few thoughts on the subject before people starting running wild.

  • For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for the key/value/hash arrays being more than offset by the improved density of key/value/hash arrays. However for sets, the net was much less favorable because we still need the indices and overallocation but can only offset the space cost by densifying only two of the three arrays. In other words, compacting makes more sense when you have wasted space for keys, values, and hashes. If you lose one of those three, it stops being compelling.

  • The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key lookups. Also, some of the optimizations for the set-to-set operations make it difficult to retain set ordering without impacting performance.

  • I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an additional indirection), I added linear probing to reduce the cost of collisions and improve cache performance. This improvement is incompatible with the compacting approach I advocated for dictionaries.

  • For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well. The docs already link to a recipe for creating an OrderedSet ( https://code.activestate.com/recipes/576694/ ) but it seems like the uptake has been nearly zero. Also, now that Eric Snow has given us a fast OrderedDict, it is easier than ever to build an OrderedSet from MutableSet and OrderedDict, but again I haven't observed any real interest because typical set-to-set data analytics don't really need or care about ordering. Likewise, the primary use of fast membership testings is order agnostic.

  • That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting special cases for orderable data where set-to-set operations can be sped-up by comparing entire ranges of keys (see https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists for a starting point). IIRC, PyPI already has code for set-like bloom filters and cuckoo hashing.

  • I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to engaging in more major rewrites of other datatypes unless we're sure that it is warranted.

– Raymond Hettinger

From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.




回答2:


Discussions

Your question is germane and has already been heavily discussed on python-devs not long ago. R. Hettinger shared a list of rationales in that thread. The state of the issue appears open-ended now, shortly after this detailed reply from T. Peters.

In short, the implementation of modern dicts that preserves insertion order is unique and not considered appropriate with sets. In particular, dicts are used everywhere to run Python (e.g. __dict__ in namespaces of objects). A major motivation behind the modern dict was to reduce size, making Python more memory-efficient overall. In contrast, sets are less prevalent than dicts within Python's core and thus dissuade such a refactoring. See also R. Hettinger's talk on the modern dict implementation.


Perspectives

The unordered nature of sets in Python parallels the behavior of mathematical sets. Order is not guaranteed.

The corresponding mathematical concept is unordered and it would be weird to impose such as order - R. Hettinger

If order of any kind were introduced to sets in Python, then this behavior would comply with a completely separate mathematical structure, namely an ordered set (or Oset). Osets play a separate roll in mathematics, particularly in combinatorics. One practical application of Osets is observed in changing of bells.

Having unordered sets are consistent with a very generic and ubiquitous data structure that unpins most modern math, i.e. Set Theory. I submit, unordered sets in Python are good to have.

See also related posts that expand on this topic:

  • Converting a list to a set changes element order
  • Get unique values from a list in python


来源:https://stackoverflow.com/questions/61414947/why-dont-python-sets-preserve-insertion-order

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!