cPickle - different results pickling the same object

限于喜欢 提交于 2021-02-07 05:52:00

问题


Is anyone able to explain the comment under testLookups() in this code snippet?

I've run the code and indeed what the comment sais is true. However I'd like to understand why it's true, i.e. why is cPickle outputting different values for the same object depending on how it is referenced.

Does it have anything to do with reference count? If so, isn't that some kind of a bug - i.e. the pickled and deserialized object would have an abnormally high reference count and in effect would never get garbage collected?


回答1:


It is looking at the reference counts, from the cPickle source:

if (Py_REFCNT(args) > 1) {
    if (!( py_ob_id = PyLong_FromVoidPtr(args)))
        goto finally;

    if (PyDict_GetItem(self->memo, py_ob_id)) {
        if (get(self, py_ob_id) < 0)
            goto finally;

        res = 0;
        goto finally;
    }
}

The pickle protocol has to deal with pickling multiple references to the same object. In order to prevent duplicating the object when depickled it uses a memo. The memo basically maps indexes to the various objects. The PUT (p) opcode in the pickle stores the current object in this memo dictionary.

However, if there is only a single reference to an object, there is no reason to store it it the memo because it is impossible to need to reference it again because it only has one reference. Thus the cPickle code checks the reference count for a little optimization at this point.

So yes, its the reference counts. But not that's not a problem. The objects unpickled will have the correct reference counts, it just produces a slightly shorter pickle when the reference counts are at 1.

Now, I don't know what you are you doing that you care about this. But you really shouldn't assume that pickling the same object will always give you the same result. If nothing else, I'd expect dictionaries to give you problems because the order of the keys is undefined. Unless you have python documentation that guarantees the pickle is the same each time I highly recommend you don't depend on it.




回答2:


There is no guarantee that seemingly identical objects will produce identical pickle strings.

The pickle protocol is a virtual machine, and a pickle string is a program for that virtual machine. For a given object there exist multiple pickle strings (=programs) that will reconstruct that object exactly.

To take one of your examples:

>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2\nI1\naI2\naI3\naI4\naI5\natp3\n."

The two pickle strings differ in their use of the p opcode. The opcode takes one integer argument and its function is as follows:

  name='PUT'    code='p'   arg=decimalnl_short

  Store the stack top into the memo.  The stack is not popped.

  The index of the memo location to write into is given by the newline-
  terminated decimal string following.  BINPUT and LONG_BINPUT are
  space-optimized versions.

To cut a long story short, the two pickle strings are basically equivalent.

I haven't tried to nail down the exact cause of the differences in generated opcodes. This could well have to do with reference counts of the objects being serialized. What is clear, however, that discrepancies like this will have no effect on the reconstructed object.



来源:https://stackoverflow.com/questions/7501577/cpickle-different-results-pickling-the-same-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!