I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing
Python code would be extremely slow when it comes to implementing data serialization. If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow. Fortunately the built-in modules which perform that are quite good.
Apart from cPickle
, you will find the marshal
module, which is a lot faster.
But it needs a real file handle (not from a file-like object).
You can import marshal as Pickle
and see the difference.
I don't think you can make a custom serializer which is a lot faster than this...
Here's an actual (not so old) serious benchmark of Python serializers
You can compress the data with bzip2:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
json.dump(hugeData, f)
Load it like this:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
hugeData = json.load(f)
You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).
faster, or even possible, to zip this pickle file prior to [writing]
Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)
See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with
gzip.GzipFile("output file name", "wb")
and then use it exactly like an ordinary file
created with open(...)
(or file(...)
for that matter).
I'd just expand on phihag's answer.
When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.
But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.
For tuples composed of strings (requires strings to contain no commas):
import ujson as json
from bz2 import BZ2File
bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()])
f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()
To re-compose the tuples:
bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
Alternatively if e.g. your keys are 2-tuples of integers:
bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.
Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.