I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle's lack of safety isn't a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime
, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I've considered pickle (cPickle), json, and plain text - but only pickle saves the type information: json can't serialize datetime.datetime
, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I'm looking for a faster alternative.
I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it's very lax and doesn't impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.
Pickle is actually quite fast so long as you aren't using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
advantages over XML:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
- Protocol Buffer - e.g. used in Caffe; maintains type information, but you have to put quite much effort in it compared to pickle
- MessagePack: See python package - supports streaming (source)
- BSON: see python package docs
For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I've found the best combination of simplicity, speed, and size by combining:
- py-ubjson
- gzip
It beats pickle
and cPickle
options by orders of magnitude.
with gzip.open(filename, 'wb') as f:
ubjson.dump(items, f)
with gzip.open(filename, 'rb') as f:
return ubjson.load(f)
I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html
If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.
If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).
Avro seems to be promising and properly designed but yet non popular solution.
Just for the sake of completeness - there is also dill
library that extends pickle
How to dill (pickle) to file?